WO2021155667A1 - Model training method and apparatus, and clustering system - Google Patents
Model training method and apparatus, and clustering system Download PDFInfo
- Publication number
- WO2021155667A1 WO2021155667A1 PCT/CN2020/117723 CN2020117723W WO2021155667A1 WO 2021155667 A1 WO2021155667 A1 WO 2021155667A1 CN 2020117723 W CN2020117723 W CN 2020117723W WO 2021155667 A1 WO2021155667 A1 WO 2021155667A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target
- node
- request
- cluster
- computing
- Prior art date
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- This application relates to the technical field of Artificial Intelligence (AI), and in particular to a model training method, device and cluster system.
- AI Artificial Intelligence
- HPC high performance computing
- the overall structure of HPC can be divided into the following main parts: external network, master node, compute node, storage, computation network, and management network.
- the computing resources of the computing nodes include single-core central processing units (central processing unit, CPU), multi-core CPUs, or multi-CPUs.
- the computing resources of a single computing node are mainly CPU-based, and the hardware capabilities are limited.
- the above-mentioned HPC uses deep learning to train AI models with low efficiency.
- the embodiments of the present application provide a model training method, device, and cluster system, which use computing nodes with GPU cards to improve the hardware capabilities of the cluster system, thereby improving the efficiency of model training.
- an embodiment of the present application provides a cluster system, including: a control node, at least one computing node, and a storage node; wherein the control node establishes a connection with each of the at least one computing node, and For allocating computing resources for the task of training the target model; the computing node includes at least one central processing unit CPU and at least one graphics processing unit GPU, for using the computing resources to train the target model; the storage node and the at least one Each computing node in the computing node establishes a network connection for storing data required for training the target model.
- any two computing nodes in the at least one computing node are interconnected to establish a network connection based on the unlimited bandwidth Infiniband technology, and the CPU and GPU inside the computing node are connected through the high-speed peripheral component interconnection PCIE, and the computing The GPU and GPU inside the node are connected through NV link.
- an embodiment of the present application provides a model training method, which is suitable for a cluster system of a control node, at least one computing node, and a storage node.
- the method includes: the control node receives the first data sent by the application program interface API server. Request, the first request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the control node is the The target model allocates a target resource, and the control node sends a second request to the target computing node, so that the target computing node uses the target resource to train the target model.
- the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
- the above method further includes: the control node receives a management request sent by the second terminal device, the management request is used to request management of the computing node in the cluster system, and the control node The management request manages the computing nodes in the cluster system.
- control node manages the computing nodes in the cluster system according to the management request, including: the control node calls the cluster open application program interface Open API to authenticate the second user; The second user passes the authentication, and the control node manages the computing nodes in the cluster system according to the management request.
- the management request carries the access key identifier of the second user and the first key
- the first key is generated by the second terminal device using a preset authentication mechanism
- the control node calling the cluster open application program interface Open API to authenticate the second user includes: the control node calling the cluster Open API, using the preset authentication mechanism to generate a second key, if the first secret If the key is the same as the second key, the control node determines the management authority of the second user, and the control node sends authority information to the second terminal device according to the management authority, so that the first The second terminal device displays the authority corresponding to the second user according to the authority information.
- the cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster; or, the cluster Open API includes a machine management API, and the management request is used to request access to all Any one of the at least one computing node performs any one of the following operations: going online, offline, restarting, reinstalling, repairing, and shielding.
- an embodiment of the present application provides a model training method suitable for a cluster system of a control node, at least one computing node, and a storage node.
- the method includes: a target computing node receives a second request sent by the control node, and The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources for the target model.
- the first request is sent by the API server through the first terminal according to the first user.
- the target node is included in the at least one computing node, the target computing node uses the target resource to train the target model, and the target computing node Send the trained target model to the storage node.
- the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
- the above method further includes: the target computing node receives a query request sent by the first terminal device, and the query request is used to request to display the target resource on the target computing node to train the
- the target computing node sends a query response to the first terminal device, and the query response carries information about the target resource usage status, so that the first terminal device is based on The usage status information displays the usage status of the target resource.
- an embodiment of the present application provides a model training device, including:
- the receiving unit is configured to receive a first request sent by an application program interface API server, the first request carrying resource information required for training the target model, and the first request is that the API server passes through the first terminal according to the first user Obtained from the resource information required by the training target model sent by the client on the Internet;
- a processing unit configured to allocate target resources to the target model according to the resource information
- the sending unit is configured to send a second request to a target computing node, so that the target computing node uses the target resource to train a target model.
- the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
- the receiving unit is further configured to receive a management request sent by a second terminal device, where the management request is used to request management of computing nodes in the cluster system;
- the processing unit is further configured to manage the computing nodes in the cluster system according to the management request.
- the processing unit when managing the computing nodes in the cluster system according to the management request, calls the cluster open application program interface Open API to authenticate the second user, if the second user Through the authentication, the computing nodes in the cluster system are managed according to the management request.
- the management request carries the access key identifier of the second user and the first key, and the first key is generated by the second terminal device using a preset authentication mechanism
- the processing unit is configured to call the cluster Open API to generate a second key using the preset authentication mechanism, and if the first key and the second key are the same, determine the second user’s Management authority; the sending unit is further configured to send authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
- the cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster; or, the cluster Open API includes a machine management API, and the management request is used to request access to all Any one of the at least one computing node performs any one of the following operations: going online, offline, restarting, reinstalling, repairing, and shielding.
- an embodiment of the present application provides a model training device, including:
- the receiving unit is configured to receive a second request sent by the control node.
- the second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model.
- a request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in the at least one computing node;
- a processing unit configured to use the target resource to train the target model
- the sending unit is used to send the trained target model to the storage node.
- the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
- the receiving unit is further configured to receive a query request sent by the first terminal device, where the query request is used to request display of target resources on the target computing node when training the target model The usage status of the target resource;
- the sending unit is further configured to send a query response to the first terminal device, where the query response carries the usage status information of the target resource, so that the first terminal device displays the The usage status of the target resource.
- an electronic device including:
- At least one processor At least one processor
- a memory communicatively connected with the at least one processor; wherein,
- the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the second aspect or any possible implementation of the second aspect method.
- an electronic device including:
- At least one processor At least one processor
- a memory communicatively connected with the at least one processor; wherein,
- the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the third aspect or any possible implementation of the third aspect method.
- the embodiments of the present application provide a computer program product containing instructions that, when run on an electronic device, cause the electronic device computer to execute the above-mentioned second aspect or the methods in the various possible implementations of the second aspect .
- embodiments of the present application provide a computer program product containing instructions, which when run on an electronic device, cause the electronic device computer to execute the foregoing third aspect or various possible implementation methods of the third aspect
- the embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions.
- the non-transitory computer-readable storage medium stores instructions that, when running on an electronic device, cause the electronic device to Perform the methods in the foregoing second aspect or various possible implementation manners of the second aspect.
- embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions.
- the non-transitory computer-readable storage medium stores instructions that, when run on an electronic device, cause the The device executes the method in the foregoing third aspect or various possible implementation manners of the third aspect.
- an embodiment of the present application provides a cluster system, including: a control node and at least one computing node, wherein the control node and each computing node of the at least one computing node are established based on the transmission control protocol TCP Network connection, the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU).
- the control node and each computing node of the at least one computing node are established based on the transmission control protocol TCP Network connection
- the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU).
- An embodiment in the above application has the following advantages or beneficial effects: by interconnecting the control node and at least one computing node through a network, the GPU is introduced as a computing resource in the computing node, thereby greatly improving the hardware capabilities of the cluster system, thereby increasing The efficiency of model training.
- the use of the HDFS file system to temporarily store the user's execution environment and the system that stores the final operating results can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and can also avoid the trained model Disadvantages of security placed on smart nodes.
- Figure 1 is a schematic structural diagram of a cluster system provided by an embodiment of the present application.
- Figure 2 is a schematic diagram of the underlying framework of a cluster system provided by an embodiment of the present application.
- Fig. 3 is a schematic diagram of network optimization of a cluster system provided by an embodiment of the present application.
- FIG. 4 is a schematic diagram of system-level performance constraint analysis of a cluster system provided by an embodiment of the present application.
- FIG. 5 is a schematic diagram of memory monitoring of computing nodes of a cluster system provided by an embodiment of the present application
- Fig. 6 is a flowchart of a model training method provided by an embodiment of the present application.
- FIG. 7 is a schematic diagram of the system architecture of HGCP in the model training method provided by an embodiment of the present application.
- FIG. 8 is a schematic diagram of the process of submitting tasks in the model training method provided by an embodiment of the present application.
- FIG. 9 is a schematic diagram of the slurm OPEN API in the model training method provided by the embodiment of the present application.
- FIG. 10 is a schematic diagram of an authentication process in the model training method provided by an embodiment of the present application.
- FIG. 11 is a schematic diagram of the deployment of api server in the model training method provided by the embodiment of the present application.
- FIG. 12 is a working schematic diagram of the super tube platform in the model training method provided by the embodiment of the present application.
- FIG. 13 is a schematic structural diagram of a model training device provided by an embodiment of the application.
- FIG. 14 is a schematic structural diagram of a model training device provided by an embodiment of the application.
- Fig. 15 is a block diagram of an electronic device used to implement the model training method of an embodiment of the present application.
- HPC High-performance computing
- the overall structure can be divided into the following main parts: external network, master node (master node), computing node (compute node), storage (stroage), computing network (computation network) and management Network (management network), etc.
- the computing resources of the computing nodes include single-core central processing units (central processing unit, CPU), multi-core CPUs, or multi-CPUs.
- the computing resources of a single computing node are mainly CPU-based, and the hardware capabilities are limited.
- the above-mentioned HPC uses deep learning to train AI models with low efficiency.
- high-performance computing cluster is a branch of computer science, which aims to solve complex reciprocal calculations or numerical calculations. It is a loosely coupled computing node composed of multiple nodes (servers). gather. Provide users with high-performance computing, network request response or professional applications (including parallel computing, database, web) and other services. However, how to manage the computing nodes of a large-scale computing cluster and how to schedule training tasks is a thorny issue.
- the embodiments of the present application provide a model training method, device, and cluster system.
- the hardware capabilities of the cluster system are greatly improved, thereby improving the efficiency of model training; in terms of software ,
- optimizing the slurm framework, introducing clients, super management platforms, etc. making the cluster system more convenient to use.
- the embodiments of the present application will be described in detail from two aspects of hardware capability improvement and software capability improvement.
- Fig. 1 is a schematic structural diagram of a cluster system provided by an embodiment of the present application.
- the cluster system provided by the embodiment of the present application includes: a control node, at least one computing node, and a storage node; wherein, the control node establishes a connection with each of the at least one computing node, such as based on Transmission Control Protocol (TCP) network connection, etc.; the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU);
- the storage node establishes a network connection with each computing node in the at least one computing node to store data required for training the target model.
- the storage node is, for example, a distributed file system (Hadoop Distributed File System, HDFS).
- HDFS Hadoop Distributed File System
- the data required by the target model includes the client, sample data set, etc.
- the target model is also stored in the storage node.
- the client is used to submit resource information, etc. to the API server, so that The API server integrates resource information, etc., obtains the first request and submits it to the control node.
- the API server is not shown in the figure. In actual implementation, the API server and the control node can be integrated or independently set. R&D personnel can log in to the cluster system through the first terminal device, submit the first request for requesting model training, etc., and the administrator can log in to the cluster system through the second terminal device to create clusters, delete clusters, online machines, offline machines, Shield operations such as machines, where machines are computing nodes.
- first terminal device and the second terminal device may be the same device or different terminal devices, which is not limited in the embodiment of the present application.
- the computing resources of each computing node include CPU and GPU.
- the computing node is, for example, an all-in-one machine for AI model training, with 3 CPUs and 8 GPUs, where the CPU and GPU can be flexibly set.
- the computing resources included in the computing node may also be a Field-Programmable Gate Array (FPGA), etc., which is not limited in the embodiment of the present application.
- FPGA Field-Programmable Gate Array
- the HDFS file system is a system used to temporarily store the user's execution environment and store the final running results. It can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and it can also avoid the placement of the trained model Disadvantages of insecurity on computing nodes.
- control nodes in the embodiment of the present application is not limited to one.
- the embodiment of the present application may set one master control node and one backup. The control node, when the main controller node fails, the standby control node can be started.
- control node and at least one computing node are interconnected through a network, and GPUs are introduced as computing resources in the computing nodes, thereby greatly improving the hardware capabilities of the cluster system and thereby improving the efficiency of model training.
- use of the HDFS file system to temporarily store the user's execution environment and the system that stores the final operating results can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and can also avoid the trained model Disadvantages of security placed on smart nodes.
- the existing cluster system is referred to as a high-performance computing (HPC) system
- the cluster system provided in the embodiments of the present application is referred to as a high-performance GPU cluster (High Performance GPU Cluster Platform, HGCP). ).
- FIG. 2 is a schematic diagram of the underlying framework of the cluster system provided by an embodiment of the present application.
- the cluster system provided by the embodiment of the present application includes six layers of chip, system design, performance optimization, cluster, framework, and application from bottom to top.
- the chip layer includes various computing resources, such as CPU, GPU, FPGA, integrated circuit (Application Specific Integrated Circuit, ASIC), and other AI chips.
- the system design layer includes cloud and edge AI all-in-one machines, high-performance storage pools, high-speed interconnection architecture, etc.
- the performance optimization layer includes calculation optimization, inpit output (IO), or communication optimization.
- the cluster layer includes K8S (Kubernetes) cloud native, intelligent scheduling, automatic expansion and contraction, etc.
- the framework layer includes some deep learning frameworks, such as Paddle Paddle, TF, Torch, etc.
- the application layer includes video, image, natural language understanding, search, recommendation or advertisement, etc.
- the cluster system provided by this application is based on the slurm open source Linux cluster resource management system, which has good scalability and high fault tolerance.
- the HGCP provided by the embodiments of the present application also has complete training task life process management, machine management, and fault monitoring capabilities, with a very high degree of automation.
- the inherent functions of slurm include resource management functions and rich job scheduling functions, such as simple first-in-first-out (FIFO), job priority calculation, resource preemption and other functions.
- FIFO simple first-in-first-out
- MPI Multi Point Interface
- the cluster system provided by the embodiment of the present application also supports the allocation of general computing resources such as GPU, network bandwidth and even memory.
- HGCP in order to open up the high-speed circulation of AI training tasks in the cluster system, HGCP has built an efficient task scheduling system in the upper layer, taking full account of the number of high-quality resources in each business and the actual running and pending training in the cluster Tasks, pool all resources, set high-quality logic quotas (quota) for each business, and specify the GPU usage ratio for single-computing node tasks and multi-computing node tasks, reducing the impact of resource fragmentation and effectively reducing cluster resource idleness , Improve the efficiency of GPU cluster resource usage and reduce operating costs.
- any two computing nodes in the at least one computing node establish a network connection based on the infinite bandwidth Infiniband technology interconnection, the CPU and the GPU inside the computing node are connected through the high-speed peripheral component interconnection PCIE, and the The GPU and GPU inside the computing node are connected through NV link.
- PCIE peripheral component interconnection
- Fig. 3 is a schematic diagram of network optimization of a cluster system provided by an embodiment of the present application.
- Each computing node includes a CPU node (node) and a GPU box (BOX).
- the CPU node includes CPU1 and CPU2.
- GPU BOX contains three non-volatile memory Express (Non-Volatile Memory Express, NVMe), referred to as hard disk, in addition, GPU BOX also includes 8 GPUs, as shown in the figure 0-8, and network interface controller (network interface controller, NIC), PCIE SW, etc.
- the solid arrow in the figure shows the PCIE connection, and the dashed arrow shows the NVlink connection.
- the GPU part of the first computing node only indicates the PCIE connection
- the GPU part of the second computing node only indicates the NVlink connection
- the GPU part of each computing node is Including PCIE connection and NVlink connection.
- the cluster system uses a new I/O bus technology infinite bandwidth (Infiniband, IB) based on full-duplex, switched serial transmission, which replaces the MPI communication method commonly used in existing cluster systems, simplifies and improves Calculate the connection speed between nodes.
- IB I/O bus technology infinite bandwidth
- the CPU and GPU in a computing node are connected by PCIE, and the GPU and GPU are interconnected by NVlink, which greatly improves the communication between the GPU cards in the computing node.
- PCIE, NVlink, Ethernet/Remote Direct Memory Access (RDMA) network bandwidths and delays vary widely, and the optimal resource combination needs to be allocated.
- the HGCP provided in the embodiments of this application adopts topology-aware scheduling , Optimize communication bandwidth.
- cluster utilization is the core evaluation indicator. Increasing utilization is equivalent to reducing the cost of its use. At the same time, it helps business training programs to perform data profiling and achieve good results for performance optimization.
- the existing HPC has no system-level fine-grained performance analysis tools. In order to achieve performance analysis, the usual method is to perform performance analysis on a single node after consulting with the business. Human intervention is required from startup to collection to data analysis. , And need to coordinate with the business to start training, only specific problems can be analyzed (case by case), and the efficiency is low and it is not suitable for large-scale promotion.
- the HGCP uses a deep learning system performance profiler (Dperf) to perform performance analysis on HGCP.
- Dperf is a common system-level one-stop performance analysis and bottleneck positioning system for deep learning training.
- This tool combines the flow information of key computing nodes on NET, IO, H2D, P2P and other data paths with the CPU, Double Data Rate (DDR), and Graphics Double Data Rate memory (Graphics Double Data Rate,
- DDR Double Data Rate
- Graphics Double Data Rate Graphics Double Data Rate
- the utilization information of key computing resources such as GDDR is uniformly captured and displayed on the same axis, which is convenient for business positioning program bottlenecks and targeted optimization.
- the Dperf training tool is combined with the cluster task scheduling to automatically monitor the tasks of the GPU training cluster.
- the Dperf provided in the embodiments of the present application has low overhead, multi-dimensionality, easy scalability, and fine-grained And visualization and other advantages. For example, refer to FIG. 4.
- Fig. 4 is a schematic diagram of system-level performance constraint analysis of a cluster system provided by an embodiment of the present application. Please refer to Figure 4.
- the entire process of deep learning training involves environment preparation, data reading, data preprocessing, forward training, backward training, and parameter update.
- Data storage is constrained by CPU, main memory and hard disk IO, while the training process is subject to upper and lower limits. The influence of factors such as line link, video memory and so on.
- Dperf system-level performance analysis tool analyze which hardware affects the program. For example, if the data reading and preprocessing time is long, and the system has more available CPU and disk resources, you can open more data processing processes to increase the data processing speed. If the training program waits for the training data for a long time, the data processing and training can be executed asynchronously to reduce the waiting time.
- the current HPC computing nodes are limited by the number of GPU cards, communication, power consumption, heat dissipation and other issues, and the computing power density is low, and they cannot withstand the needs of model training tasks.
- the HGCP provided in the embodiments of the present application utilizes computing nodes with GPUs, has high computing density, high heat dissipation efficiency, supports the systemization of hardware modules, the standardization of interconnection interfaces, and the flexibility of interconnection topology, leading the hardware development direction of AI computing. Participate It also leads the development of AI hardware platforms and effectively supports cluster AI training tasks.
- the current HPC Wu real-time fine-grained monitoring lack of fine-grained monitoring of each computing node, computing task, so that the utilization information of key resources such as CPU, DDR, GPU, and GDDR agree to capture and display coaxially, users and management only Being able to log in to the physical node to view the operating status of the machine, or passively inform the fault information from the business, greatly affects the efficiency of the cluster operation.
- a monitoring platform and a hardware monitoring plug-in are deployed in the HGCP cluster to monitor and collect data in real time.
- the key performance data such as CPU, GPU, memory, network and storage of functional components such as control nodes and computing nodes of the HGCP cluster can be visually displayed in a graphical manner to understand the operating status of the hardware environment and discover in time that may be hidden in HGCP The problem of failure, and then provide solutions to the failure in the first time.
- FIG. 5 is a schematic diagram of memory monitoring of a computing node of a cluster system provided in an embodiment of the present application. Please refer to Figure 5. From 14:40 to 15:40, the memory occupation of a computing node is shown in the waveform in the figure.
- the HGCP provided in the embodiments of this application has a smooth operation and maintenance process at the beginning of its construction, and needs to realize processization, process standardization, and standard automation. At the same time, operation and maintenance automation cannot solve all problems. It cannot be automated for the sake of automation. 20% of repetitive tasks consume 80% of time and energy. You only need to concentrate on doing 20% of repetitive tasks, and you can basically achieve a good state. .
- the cluster automated operation and maintenance tool is designed to manage a large number of computing nodes and has a single graphical user interface.
- the HGCP cluster provided by the embodiment of the present application performs machine management through a super-management platform system.
- Fig. 6 is a flowchart of a model training method provided by an embodiment of the present application. This embodiment explains in detail the model training method described in the embodiment of the present application from the perspective of the interaction between the control node and the computing node.
- the present embodiment includes:
- the client on the first terminal sends resource information required for training the target model to the API server.
- the control node receives the first request sent by the application program interface API server.
- the first request is obtained by the API server according to the resource information required by the training target model sent by the first user through the client on the first terminal.
- the current HPC does not have a user client, and the first user uses the HCP training model to be cumbersome.
- the first user usually refers to a researcher who trains a model such as a researcher, and the model may be various AI models, such as a face recognition model, a face detection model, etc., which are not limited in the embodiment of the present application.
- the model training method provided by the embodiment of the present application encapsulates the HGCP in advance to obtain a client, which is stored on HDFS for download by the first user.
- the first user downloads and installs the client on the first terminal device, and the client is used to submit training tasks to HGCP.
- the control node allocates target resources to the target model according to the resource information.
- the current HPC training task management is extensive. Although it can face multi-tenancy, that is, it can be used by multiple first users at the same time. Different first users train different target models, but different first users have usage requirements. At peaks and troughs, most of the existing slurm-based HPCs use the FIFO queuing mechanism by default, there is no priority limit, and no over-transmission is supported, which makes some first user resources idle, while other first users have no resources available.
- the computing resources of HGCP include CPU, GPU, memory, FPGA, etc.
- the configuration interface is displayed on the display interface of the first terminal device for the first user to configure the number of computing nodes required for training the target model. For each computing node, which CPU, GPU, etc.
- the first terminal device generates the resource information required for training the target model according to the configuration input by the user, and sends the resource information to the API server, API The server integrates the resource information, etc., generates a first request and sends it to the control node.
- the control node allocates computing resources for the target model according to the first request.
- the resource information carried in the first request is 4 computing nodes and 16 GPUs, and the control node allocates 4 computing nodes to the target model. Assuming that there are 8 GPUs on each computing node, the 4 computing nodes are respectively Provide 4 GPUs for the target model, or the 4 computing nodes provide 4, 6, 2, and 2 GPUs in sequence.
- the control node sends a second request to the target computing node.
- the target computing node is a computing node containing the target resource.
- control node configures the target resource for the target model, it sends a second request to the computing node containing the target resource to trigger the target computing node to train the target model.
- the target computing node uses the target resource to train a target model.
- the target computing node stores the trained target model in the storage node.
- step 102 continues to use the example in step 102 above, assuming that the target computing nodes that provide 16 GPUs are computing node 1, computing node 2, computing node 3, and computing node 4 , The four computing nodes are used as target computing nodes, and the target model is trained in a distributed manner. After the signaling is completed, the respective trained parts are stored in the storage node, such as in HDFS.
- the control node after receiving the first request sent by the API server, the control node allocates target resources to the target model according to the first request, and sends a second request to the target computing node containing the target resource to Trigger the target computing node to perform model training, and store the trained model in the HDFS system.
- software improvements generally include system architecture improvements and slurm open application programming interface (Application Programming Interface, API) improvements. The two improvements will be described in detail below.
- FIG. 7 is a schematic diagram of the system architecture of HGCP in the model training method provided by an embodiment of the present application.
- the HGCP system provided by the embodiment of the present application realizes complete isolation of users and resources.
- the first user downloads and installs the client from the HDFS system, and sends the resource information required for training the target model to the API through the client, so that the API server integrates the resource information, etc. to obtain the first request, and submits it to the control node The first request for training the model.
- the target task is running on the target node
- the first user can send a query request to the target computing node through the first terminal device.
- the query request is used to request to display the target resource on the target computing node when training the target model.
- the usage status of the target resource after receiving the query request, the target computing node obtains the running status of the task of training the target model, and downloads the data generated during the permission process. After that, the target computing node sends to the first terminal device A query response, where the query response carries the usage status information of the target resource, so that the first terminal device displays the usage status of the target resource according to the usage status information.
- the target model is trained, the target model is maintained on the HDFS system, and the first user or other users can download the final result from the HDFS system.
- each model in FIG. 7 will be described in detail.
- the first user can download anywhere and send resource information to the API server through the client according to the client stored on the HDFS system, so that the API server can integrate the resource information to obtain the first request. And send the first request to the control node, a first request can be regarded as a task.
- the resource information carried in the first request includes at least one of the following information: the number of target computing nodes, the number of GPUs that are occupied when the target computing node is used to train the target model, and the target computing node is used to train the The number of CPUs occupied in the target model, the path of the HDFS system, and the user name or password of the HDFS.
- the background corresponding to the client uses the slurm OPEN API described in the embodiments of this application to perform tasks such as submitting, viewing, terminating, and obtaining training data, and the job submission adopts an asynchronous submission mode.
- tasks such as submitting, viewing, terminating, and obtaining training data
- job submission adopts an asynchronous submission mode.
- FIG. 8 is a schematic diagram of the process of submitting tasks in the model training method provided by the embodiment of the present application.
- the first user submits the job to the upper layer through the client on the first terminal device, the API service (server) performs request authentication, and the job is stored in the database after the authentication is passed.
- the job manager running on the control node obtains the submitted job from the database and submits the job to HGCP.
- the job synchronization (Job SyncController) running on the computing node synchronizes the running status of the job to the monitor server (Monitor server). ) And slurm resource management system.
- the HDFS system is a system used to temporarily store the user execution environment and store the final trained model, where the user execution environment is the aforementioned client.
- the embodiment of the present application does not limit the HDFS system when necessary, and in other feasible implementation manners, it may also be a file system private to the first user.
- the resource scheduler is a module on the control node, which is used to allocate target resources to the target model according to the first request.
- the granularity of resource allocation is based on GPU instead of computing node. If a model training task of the first user cannot use up all the GPUs on the target computing node, the target computing node and its remaining GPUs will be allocated to other training tasks.
- the scheduler can support mixed scheduling of CPU and GPU at the same time. For example, when the first user submits a training task whose required resource is a GPU, if all the GPUs are not used up, other users can also submit training tasks through the remaining GPUs.
- resources are divided at the granularity of computing nodes and GPUs, and one training task can be run on different GPUs of different computing nodes.
- FIG. 9 is a schematic diagram of the slurm OPEN API in the model training method provided by the embodiment of the present application.
- the architecture includes:
- Third-party platforms refer to some deep learning platforms, such as paddle clound and other platforms;
- Cluster component refers to the slurm cluster client
- API server refers to the unified entrance of slurm OPEN API, responsible for route analysis and request processing, etc.;
- Authentication refers to the slurm cluster authentication service module
- Database refers to the XDB data platform, which stores data such as user permissions, job information, and queue quota (quota);
- Job management used for job management control, responsible for job queuing and submission control;
- job synchronization controller (job sync Controller) is responsible for synchronizing data such as job status, GPU utilization, GPU slot, node rank, and time;
- Queue synchronization controller responsible for pushing queue update events to third-party platforms (new queue, queue Quota update, etc.);
- Node Monitoring Service Deployed on each computing node, providing running data of training jobs on that computing node.
- Open API interface authentication is mainly used for requesting identity authentication and judging the legitimacy of the current request. Common methods include token authentication and AK/SK authentication; for interface access security, this article uses AK/SK authentication method.
- the control node receives a management request sent by a second user using a second terminal device, and the management request carries the access key identifier of the second user and the first key, and the first key It is generated by the second terminal device using a preset authentication mechanism.
- the control node calls the cluster open application program interface Open API to authenticate the second user, the control node calls the cluster Open API to use the preset authentication mechanism Generate a second key.
- the control node determines the management authority of the second user, and the control node issues the management authority to the second terminal according to the management authority.
- the device sends a data stream for updating the graphical interface of the management platform, so that the second terminal device updates and displays the graphical interface of the management platform, so that the second user can manage the graphical interface of the management platform through the updated management platform graphical interface.
- Cluster system
- the access key ID (access key ID) is used to identify the second user
- the first key is, for example, the secret access key (Secret Access Key, SK), which is used by the second user.
- SK secret Access Key
- the control node After receiving the management request sent by the second user, the control node uses the same preset authentication mechanism to generate an authentication string, which is referred to as the second key below. After that, the control node compares the first key in the management request with the generated second key, and if the two keys are the same, it specifies the management authority for the second user and performs related operations. If the two keys are the same If they are not the same, the control node will ignore the operation and return an error code to the second terminal device.
- FIG. 10 is a schematic diagram of the authentication process in the model training method provided by the embodiment of the present application.
- the second user sends AK/SK to the authentication service on the control node through the client on the second terminal device, and the authentication service returns a token to the second terminal device; after that, the second user
- the client on the second terminal device sends a management request and token to the API service on the control node, and the API service sends a management response to the second terminal device according to the management request and token.
- the second user is an administrator, which can be divided into multiple levels, such as cluster administrators, department administrators, ordinary users, etc., for example, see Table 2.
- FIG. 11 is a schematic diagram of the deployment of api server in the model training method provided by the embodiment of the present application.
- api server is deployed on three servers including server 1, server2 and server3.
- server1 deploys job_manager, job_sync_controller and 4 apiserver instances
- server2 and server3 deploy 1 nginx instance
- api server is bound to nginx
- nginx is bound to BGW.
- a super-management platform is set up for HGCP for machine management, cluster management, etc., mainly to provide the following main features for administrators and users:
- the HGCP super-management platform system runs on a LINUX server, and uses MySQL database to store statistics, monitoring, configuration, logs and other data.
- the back-end is integrated into a module with general functions, and Hypertext Preprocessor is used. PHP), Python, Ansible, Shell development, through the super management platform API interface to operate database data and computing nodes.
- the front-end display page is for cluster administrators and ordinary users, simplifying operations as much as possible and improving efficiency;
- the HGCP provided in the embodiments of this application is equipped with multiple control nodes to ensure the disaster tolerance and service continuity of the management system. These control nodes use Ansile to remotely manage the cluster computing nodes to perform environmental configuration, upgrade adjustments, and system inspections. And so on.
- the control node receives a management request sent by a second user using a second terminal device, the management request is used to request management of the computing node in the cluster system, and the management request is the second terminal
- the device is obtained according to the user's operation on the graphical interface of the management platform, and then the control node calls the cluster open application program interface Open API to authenticate the second user; then, if the second user passes the authentication, the control node
- the management request manages computing nodes in the cluster system.
- FIG. 12 is a working schematic diagram of the super tube platform in the model training method provided by the embodiment of the present application.
- the cluster Open API includes cluster management API and machine management API.
- the screen of the second terminal device displays the management platform graphical interface of the super-management platform.
- the cluster administrator performs cluster operations through the over-platform graphical interface and calls down the cluster.
- Management API or machine management API when the cluster management API is called, the management request is used to request the creation or deletion of a cluster.
- the cluster information in the database is configured, and the underlying cluster management module (cluster_manager) detects that there is a new operation task in the database and starts to perform related operations;
- cluster_manager cluster management module
- the management request is used to request any one of the at least one computing node to perform any one of the following operations: online, offline, restart, reinstall, repair, and shield.
- configure For node information in the database, the node management module (node_manager) detects that there is a new operation task in the database and starts to perform related operations.
- Operations for clusters include:
- the administrator For cluster security, the administrator must first cluster all machines linearly before deleting the cluster. During the deletion process, first, verify the parameters, including whether the cluster exists, whether there are still running machines in the cluster, whether the online parameters are legal, etc.; then, write the cluster task into the cluster operation task (cluster_task table), and the task operation (task_op ) Is set to uninstall, the task status (task-status) is set to pending, and finally the cluster manager completes the real offline operation.
- the basic information list of the cluster includes the cluster_info table and the cluster_task table.
- the cluster_info table contains the information of the cluster that has been running online, and the cluster_info table contains the information of the cluster in the process. If the two tables represent the same cluster, and if there are offline operations, the status is based on the status in the cluster_task table.
- the cluster_info table contains the online clusters, and the node_info table aggregates the required information.
- the cluster machine list includes node_info table and node_task table.
- the node_info table obtains the list of online machines
- the node_info table obtains the list of online machines
- the node_task table obtains the list of machines in the process.
- going online is an operation, and the effect of this operation is to expand the capacity of the cluster system.
- the parameters are verified, including verifying the existence of the cluster, and verifying the validity of the online parameters.
- write the online task to the node operation task (node_task table), set the task operation (task_op) to install (install), set the task status (task-status) to pending (pending), and write the information about the node to be launched and the node
- the node information (node_info) table is marked as installing, and finally the node manager (node manager) completes the actual online operation and completes the update of the task and info tables.
- the machine when the machine is downloaded, the machine is automatically marked as unschedulable first, and then the offline process is executed.
- the parameters are first verified, including verifying the existence of the cluster, and verifying the validity of the offline parameters. After that, query the node information (node_info) table.
- the parameters are first verified, including verifying whether the machine joins the cluster, and verifying the validity of the parameters. After that, query the cluster information (cluster_info) table to obtain the cluster apiserver address, and call the apiserver interface to complete the state shielding.
- cluster_info cluster information
- the parameters are first verified, including verifying whether the machine joins the cluster, and verifying the validity of the parameters. After that, query the cluster information (cluster_info) table to obtain the cluster apiserver address, and call the apiserver interface to complete the machine attribution label change.
- cluster_info cluster information
- FIG. 13 is a schematic structural diagram of a model training device provided by an embodiment of the application.
- the device can be integrated in an electronic device or realized by an electronic device, and the electronic device can be a terminal device or a server.
- the model training apparatus 100 may include:
- the receiving unit 11 is configured to receive a first request sent by an application program interface API server, where the first request is resource information required by the API server according to the training target model sent by the first user through the client on the first terminal owned;
- the processing unit 12 is configured to allocate target resources to the target model according to the resource information
- the sending unit 13 is configured to send a second request to a target computing node, so that the target computing node uses the target resource to train a target model.
- the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
- the receiving unit 11 is further configured to receive a management request sent by a second terminal device, and the management request is used to request management of computing nodes in the cluster system;
- the processing unit 12 is further configured to manage the computing nodes in the cluster system according to the management request.
- the processing unit 12 when the processing unit 12 manages the computing nodes in the cluster system according to the management request, it calls the cluster open application program interface Open API to authenticate the second user. After the user passes the authentication, the computing nodes in the cluster system are managed according to the management request.
- the management request carries the access key identifier of the second user and the first key, and the first key is generated by the second terminal device using a preset authentication mechanism, so
- the processing unit 12 is configured to call the cluster Open API to generate a second key using the preset authentication mechanism, and if the first key and the second key are the same, determine the second user Management authority;
- the sending unit 13 is further configured to send authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
- the cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster;
- the cluster Open API includes a machine management API, and the management request is used to request any one of the at least one computing node to perform any of the following operations: online, offline, restart, reinstall, maintenance, shield.
- the device provided in the embodiment of the present application can be used in the method executed by the control node in the above embodiment, and its implementation principle and technical effect are similar, and will not be repeated here.
- FIG. 14 is a schematic structural diagram of a model training device provided by an embodiment of the application.
- the device can be integrated in an electronic device or realized by an electronic device, and the electronic device can be a terminal device or a server.
- the model training device 200 may include:
- the receiving unit 21 is configured to receive a second request sent by the control node.
- the second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model.
- the first request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in the at least one computing node;
- the processing unit 22 is configured to use the target resource to train the target model
- the sending unit 23 is used to send the trained target model to the storage node.
- the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
- the receiving unit 21 is further configured to receive a query request sent by the first terminal device, and the query request is used to request display of target resources on the target computing node to train the target model The usage status of the target resource at the time;
- the sending unit 23 is further configured to send a query response to the first terminal device, where the query response carries the usage status information of the target resource, so that the first terminal device displays the information according to the usage status information. State the usage status of the target resource.
- the device provided in the embodiment of the present application can be used in the method executed by the target computing node in the above embodiment, and its implementation principles and technical effects are similar, and will not be repeated here.
- Fig. 15 is a block diagram of an electronic device used to implement the model training method of an embodiment of the present application.
- Electronic devices are intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
- Electronic devices can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices.
- the components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the application described and/or required herein.
- the electronic device includes: one or more processors 31, memory 32, and interfaces for connecting various components, including a high-speed interface and a low-speed interface.
- the various components are connected to each other using different buses, and can be installed on a common motherboard or installed in other ways as needed.
- the processor may process instructions executed in the electronic device, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to an interface).
- an external input/output device such as a display device coupled to an interface.
- multiple processors and/or multiple buses can be used with multiple memories and multiple memories.
- multiple electronic devices can be connected, and each device provides part of the necessary operations (for example, as a server array, a group of blade servers, or a multi-processor system).
- One processor 31 is taken as an example in FIG. 15.
- the memory 32 is a non-transitory computer-readable storage medium provided by this application.
- the memory stores instructions executable by at least one processor, so that the at least one processor executes the model training method provided in this application.
- the non-transitory computer-readable storage medium of the present application stores computer instructions, and the computer instructions are used to make a computer execute the model training method provided in the present application.
- the memory 32 can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the model training method in the embodiment of the present application (for example, The receiving unit 11, the processing unit 12, and the sending unit 13 shown in FIG. 13, and the receiving unit 21, the processing unit 22, and the sending unit 23 shown in FIG. 14).
- the processor 31 executes various functional applications and data processing of the server by running non-transitory software programs, instructions, and modules stored in the memory 32, that is, implementing the method of model training in the foregoing method embodiment.
- the memory 32 may include a program storage area and a data storage area.
- the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created based on the use of an electronic device trained by the model.
- the memory 32 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
- the memory 32 may optionally include memories remotely provided with respect to the processor 31, and these remote memories may be connected to an electronic device for model training via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
- the electronic equipment of the model training method may further include: an input device 33 and an output device 34.
- the processor 31, the memory 32, the input device 33, and the output device 34 may be connected by a bus or in other ways. In FIG. 15, the connection by a bus is taken as an example.
- the input device 33 can receive input digital or character information, and generate key signal input related to the user settings and function control of the electronic equipment for model training, such as touch screen, keypad, mouse, track pad, touch pad, indicator stick, a Or multiple mouse buttons, trackballs, joysticks and other input devices.
- the output device 34 may include a display device, an auxiliary lighting device (for example, LED), a tactile feedback device (for example, a vibration motor), and the like.
- the display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some embodiments, the display device may be a touch screen.
- Various implementations of the systems and techniques described herein can be implemented in digital electronic circuit systems, integrated circuit systems, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: being implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, the programmable processor It can be a dedicated or general-purpose programmable processor that can receive data and instructions from the storage system, at least one input device, and at least one output device, and transmit the data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.
- machine-readable medium and “computer-readable medium” refer to any computer program product, device, and/or device used to provide machine instructions and/or data to a programmable processor ( For example, magnetic disks, optical disks, memory, programmable logic devices (PLD)), including machine-readable media that receive machine instructions as machine-readable signals.
- machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
- the systems and techniques described here can be implemented on a computer that has: a display device for displaying information to the user (for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) ); and a keyboard and a pointing device (for example, a mouse or a trackball) through which the user can provide input to the computer.
- a display device for displaying information to the user
- LCD liquid crystal display
- keyboard and a pointing device for example, a mouse or a trackball
- Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (for example, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, voice input, or tactile input) to receive input from the user.
- the systems and technologies described herein can be implemented in a computing system that includes back-end components (for example, as a data server), or a computing system that includes middleware components (for example, an application server), or a computing system that includes front-end components (for example, A user computer with a graphical user interface or a web browser, through which the user can interact with the implementation of the system and technology described herein), or includes such back-end components, middleware components, Or any combination of front-end components in a computing system.
- the components of the system can be connected to each other through any form or medium of digital data communication (for example, a communication network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
- the computer system can include clients and servers.
- the client and server are generally far away from each other and usually interact through a communication network.
- the relationship between the client and the server is generated by computer programs that run on the corresponding computers and have a client-server relationship with each other.
- An embodiment of the present application also provides a cluster system, including: a control node and at least one computing node, wherein the control node establishes a network connection with each computing node of the at least one computing node based on the transmission control protocol TCP;
- the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU).
- the hardware capability of the cluster system is greatly improved, thereby improving the efficiency of model training; in terms of software, the slurm framework is optimized, and the client and super management are introduced. Platform, etc., make the cluster system more convenient to use.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
操作operate | 操作方式Operation method | 平均操作时间Average operating time |
新机器环境配置New machine environment configuration | 自动化脚本Automation script | 20min20min |
机器上线集群Machine online cluster | 手动操作Manual operation | 30min30min |
集群资源队列调整Cluster resource queue adjustment | 手动操作Manual operation | 10min10min |
机器故障维修Machine breakdown repair | 维修人员接入Maintenance personnel access | 1day1day |
故障信息统计Fault information statistics | 手动操作Manual operation | 1hour1hour |
集群环境升级Cluster environment upgrade | 手动操作Manual operation | 1day1day |
Claims (23)
- 一种集群系统,其特征在于,包括:控制节点、至少一个计算节点、存储节点;其中,A cluster system is characterized by comprising: a control node, at least one computing node, and a storage node; wherein,所述控制节点,与所述至少一个计算节点中的各计算节点建立连接,用于为训练目标模型的任务分配计算资源;The control node establishes a connection with each computing node of the at least one computing node, and is used to allocate computing resources for the task of training the target model;所述计算节点包括至少一个中央处理器CPU和至少一个图形处理器GPU,用于利用所述计算资源训练目标模型;The computing node includes at least one central processing unit (CPU) and at least one graphics processing unit (GPU), and is configured to use the computing resources to train a target model;所述存储节点与所述至少一个计算节点中的各计算节点建立网络连接,用于存储训练目标模型所需的数据。The storage node establishes a network connection with each of the at least one computing node, and is used to store data required for training the target model.
- 根据权利要求1所述的系统,其特征在于,The system of claim 1, wherein:所述至少一个计算节点中的任意两个计算节点基于无限带宽Infiniband技术互联建立网络连接,所述计算节点内部的CPU与GPU通过高速外围组件互联PCIE连接,所述计算节点内部的GPU与GPU通过NV link连接。Any two computing nodes in the at least one computing node establish a network connection based on the infinite bandwidth Infiniband technology interconnection, the CPU and GPU inside the computing node are connected through high-speed peripheral component interconnection PCIE, and the GPU and GPU inside the computing node are connected through NV link connection.
- 一种模型训练方法,其特征在于,适用于控制节点、至少一个计算节点、存储节点的集群系统,所述方法包括:A model training method, characterized in that it is suitable for a cluster system of a control node, at least one computing node, and a storage node. The method includes:所述控制节点接收应用程序接口API服务器发送的第一请求,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的;The control node receives a first request sent by an application program interface API server, and the first request is obtained by the API server according to the resource information required by the training target model sent by the first user through the client on the first terminal ;所述控制节点根据所述资源信息,为所述目标模型分配目标资源;The control node allocates target resources to the target model according to the resource information;所述控制节点向目标计算节点发送第二请求,使得所述目标计算节点使用所述目标资源训练目标模型。The control node sends a second request to the target computing node, so that the target computing node uses the target resource to train a target model.
- 根据权利要求3所述的方法,其特征在于,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。The method according to claim 3, wherein the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
- 根据权利要求3或4所述的方法,其特征在于,还包括:The method according to claim 3 or 4, further comprising:所述控制节点接收第二终端设备发送的管理请求,所述管理请求用于请求管理所述集群系统中的计算节点;The control node receives a management request sent by a second terminal device, where the management request is used to request management of computing nodes in the cluster system;所述控制节点根据所述管理请求管理所述集群系统中的计算节点。The control node manages the computing nodes in the cluster system according to the management request.
- 根据权利要求5所述的方法,其特征在于,所述控制节点根据所述管理请求管理所述集群系统中的计算节点,包括:The method according to claim 5, wherein the control node managing the computing nodes in the cluster system according to the management request comprises:所述控制节点调用集群开放应用程序接口Open API对第二用户鉴权;The control node calls the cluster open application program interface Open API to authenticate the second user;若所述第二用户通过鉴权,则所述控制节点根据所述管理请求管理所述集群系统中的计算节点。If the second user passes the authentication, the control node manages the computing nodes in the cluster system according to the management request.
- 根据权利要求6所述的方法,其特征在于,所述管理请求携带所述第二用户的访问密钥标识和第一密钥,所述第一密钥是所述第二终端设备利用预设认证机制生成的,所述控制节点调用集群开放应用程序接口Open API对第二用户鉴权,包括:The method according to claim 6, wherein the management request carries the access key identifier of the second user and a first key, and the first key is the second terminal device using a preset The authentication mechanism generated by the control node invoking the cluster open application program interface Open API to authenticate the second user includes:所述控制节点调用所述集群Open API,利用所述预设认证机制生成第二密钥;The control node calls the cluster Open API, and generates a second key by using the preset authentication mechanism;若所述第一密钥和所述第二密钥相同,则所述控制节点确定所述第二用户的管理权限;If the first key and the second key are the same, the control node determines the management authority of the second user;所述控制节点根据所述管理权限向所述第二终端设备发送权限信息,以使得所述第二终端设备根据所述权限信息显示所述第二用户对应的权限。The control node sends authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
- 根据权利要求6所述的方法,其特征在于,The method of claim 6, wherein:所述集群Open API包括集群管理API,所述管理请求用于请求创建或删除集群;The cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster;或者,or,所述集群Open API包括机器管理API,所述管理请求用于请求对所述至少一个计算节点中的任意一个计算节点执行下述任一项操作:上线、下线、重启、重装、维修、屏蔽。The cluster Open API includes a machine management API, and the management request is used to request any one of the at least one computing node to perform any of the following operations: online, offline, restart, reinstall, maintenance, shield.
- 一种模型训练方法,其特征在于,适用于控制节点、至少一个计算节点、存储节点的集群系统,所述方法包括:A model training method, characterized in that it is suitable for a cluster system of a control node, at least one computing node, and a storage node. The method includes:目标计算节点接收控制节点发送的第二请求,所述第二请求是所述控制节点接收到应用程序接口API服务器发送的第一请求并为目标模型分配目标资源后发送的,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的,所述目标节点包含于所述至少一个计算节点;The target computing node receives a second request sent by the control node. The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model. The first request Is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in the at least one computing node;所述目标计算节点使用所述目标资源训练所述目标模型;The target computing node uses the target resource to train the target model;所述目标计算节点将训练好的目标模型发送至存储节点。The target computing node sends the trained target model to the storage node.
- 根据权利要求9所述的方法,其特征在于,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。The method according to claim 9, wherein the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
- 根据权利要求9或10所述的方法,其特征在于,还包括:The method according to claim 9 or 10, further comprising:所述目标计算节点接收所述第一终端设备发送的查询请求,所述查询请求用于请求展示所述目标计算节点上的目标资源训练所述目标模型时所述目标资源的使用状况;Receiving, by the target computing node, a query request sent by the first terminal device, where the query request is used to request to display the target resource usage status of the target resource on the target computing node when the target model is trained;所述目标计算节点向所述第一终端设备发送查询响应,所述查询响应携带所述目标资源的使用状况信息,以使得所述第一终端设备根据所述使用状况信息显示所述目标资源的使用状况。The target computing node sends a query response to the first terminal device, and the query response carries the usage status information of the target resource, so that the first terminal device displays the status of the target resource according to the usage status information. Usage status.
- 一种模型训练装置,其特征在于,包括:A model training device is characterized in that it comprises:接收单元,用于接收应用程序接口API服务器发送的第一请求,所述第一请求携带训练目标模型所需的资源信息,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的;The receiving unit is configured to receive a first request sent by an application program interface API server, the first request carrying resource information required for training the target model, and the first request is that the API server passes through the first terminal according to the first user Obtained from the resource information required by the training target model sent by the client on the Internet;处理单元,用于根据所述资源信息,为所述目标模型分配目标资源;A processing unit, configured to allocate target resources to the target model according to the resource information;发送单元,用于向目标计算节点发送第二请求,使得所述目标计算节点使用所述目标资源训练目标模型。The sending unit is configured to send a second request to a target computing node, so that the target computing node uses the target resource to train a target model.
- 根据权利要求12所述的装置,其特征在于,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。The apparatus according to claim 12, wherein the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
- 根据权利要求12或13所述的装置,其特征在于,The device according to claim 12 or 13, characterized in that:所述接收单元,还用于接收第二终端设备发送的管理请求,所述管理请求用于请求管理集群系统中的计算节点;The receiving unit is further configured to receive a management request sent by the second terminal device, where the management request is used to request management of computing nodes in the cluster system;所述处理单元,还用于根据所述管理请求管理所述集群系统中的计算节点。The processing unit is further configured to manage the computing nodes in the cluster system according to the management request.
- 根据权利要求14所述的装置,其特征在于,The device of claim 14, wherein:所述处理单元,在根据所述管理请求管理所述集群系统中的计算节点时,调用集群开放应用程序接口Open API对第二用户鉴权,若所述第二用户通过鉴权,则根据所述管理请求管理所述集群系统中的计算节点。The processing unit, when managing the computing nodes in the cluster system according to the management request, calls the cluster open application program interface Open API to authenticate the second user, and if the second user passes the authentication, then The management request manages the computing nodes in the cluster system.
- 根据权利要求15所述的装置,其特征在于,The device of claim 15, wherein:所述管理请求携带所述第二用户的访问密钥标识和第一密钥,所述第一密钥是所述第二终端设备利用预设认证机制生成的,所述处理单元,用于调用所述集群Open API,利用所述预设认证机制生成第二密钥,若所述第一密钥和所述第二密钥相同,则确定所述第二用户的管理权限;The management request carries the access key identifier of the second user and a first key, the first key is generated by the second terminal device using a preset authentication mechanism, and the processing unit is configured to call The cluster Open API uses the preset authentication mechanism to generate a second key, and if the first key and the second key are the same, determine the management authority of the second user;所述发送单元,还用于根据所述管理权限向所述第二终端设备发送权限信息,以使得所述第二终端设备根据所述权限信息显示所述第二用户对应的权限。The sending unit is further configured to send authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
- 根据权利要求15所述的装置,其特征在于,The device of claim 15, wherein:所述集群Open API包括集群管理API,所述管理请求用于请求创建或删除集群;The cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster;或者,or,所述集群Open API包括机器管理API,所述管理请求用于请求对所述至少一个计算节点中的任意一个计算节点执行下述任一项操作:上线、下线、重启、重装、维修、屏蔽。The cluster Open API includes a machine management API, and the management request is used to request any one of the at least one computing node to perform any of the following operations: online, offline, restart, reinstall, maintenance, shield.
- 一种模型训练装置,其特征在于,包括:A model training device is characterized in that it comprises:接收单元,用于接收控制节点发送的第二请求,所述第二请求是所述控制节点接收到应用程序接口API服务器发送的第一请求并为目标模型分配目标资源后发送的,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的,所述目标节点包含于至少一个计算节点;The receiving unit is configured to receive a second request sent by the control node. The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model. A request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in at least one computing node;处理单元,用于使用所述目标资源训练所述目标模型;A processing unit, configured to use the target resource to train the target model;发送单元,用于将训练好的目标模型发送至存储节点。The sending unit is used to send the trained target model to the storage node.
- 根据权利要求18所述的装置,其特征在于,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。The device according to claim 18, wherein the resource information comprises at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
- 根据权利要求18或19所述的装置,其特征在于,The device according to claim 18 or 19, wherein:所述接收单元,还用于接收所述第一终端设备发送的查询请求,所述查询请求用于请求展示所述目标计算节点上的目标资源训练所述目标模型时所述目标资源的使用状况;The receiving unit is further configured to receive a query request sent by the first terminal device, where the query request is used to request display of the target resource on the target computing node when the target model is trained. ;所述发送单元,还用于向所述第一终端设备发送查询响应,所述查询响应携带所述目标资源的使用状况信息,以使得所述第一终端设备根据所述使用状况信息显示所述目标资源的使用状况。The sending unit is further configured to send a query response to the first terminal device, where the query response carries the usage status information of the target resource, so that the first terminal device displays the The usage status of the target resource.
- 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:至少一个处理器;以及At least one processor; and与所述至少一个处理器通信连接的存储器;其中,A memory communicatively connected with the at least one processor; wherein,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求3-8中任一项所述的方法;或者,以使所述至少一个处理器能够执行权利要求9-11任一项所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute any one of claims 3-8的方法; Or, to enable the at least one processor to execute the method of any one of claims 9-11.
- 一种存储有计算机指令的非瞬时计算机可读存储介质,其特征在于,所述计算机指令用于使所述计算机执行权利要求3-8中任一项所述的方法;或者,所述计算机指令用于使所述计算机执行权利要求9-11中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make the computer execute the method according to any one of claims 3-8; or, the computer instructions It is used to make the computer execute the method according to any one of claims 9-11.
- 一种集群系统,其特征在于,包括:控制节点和至少一个计算节点,其中,A cluster system is characterized by comprising: a control node and at least one computing node, wherein,所述控制节点,与所述至少一个计算节点中的各计算节点基于传输控制协议TCP建立网络连接;The control node establishes a network connection with each of the at least one computing node based on the transmission control protocol TCP;所述计算节点的计算资源包括至少一个中央处理器CPU和至少一个图形处理器GPU。The computing resources of the computing node include at least one central processing unit CPU and at least one graphics processing unit GPU.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010080825.4A CN111327692A (en) | 2020-02-05 | 2020-02-05 | Model training method and device and cluster system |
CN202010080825.4 | 2020-02-05 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021155667A1 true WO2021155667A1 (en) | 2021-08-12 |
Family
ID=71172573
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/117723 WO2021155667A1 (en) | 2020-02-05 | 2020-09-25 | Model training method and apparatus, and clustering system |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111327692A (en) |
WO (1) | WO2021155667A1 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111327692A (en) * | 2020-02-05 | 2020-06-23 | 北京百度网讯科技有限公司 | Model training method and device and cluster system |
CN111984744B (en) * | 2020-08-13 | 2021-03-19 | 北京陌陌信息技术有限公司 | Information processing method based on remote communication and artificial intelligence and cloud service platform |
CN112087506B (en) * | 2020-09-01 | 2023-02-07 | 北京火山引擎科技有限公司 | Cluster node management method and device and computer storage medium |
CN112241321A (en) * | 2020-09-24 | 2021-01-19 | 北京影谱科技股份有限公司 | Computing power scheduling method and device based on Kubernetes |
CN113033098B (en) * | 2021-03-26 | 2022-05-17 | 山东科技大学 | Ocean target detection deep learning model training method based on AdaRW algorithm |
CN113159284A (en) * | 2021-03-31 | 2021-07-23 | 华为技术有限公司 | Model training method and device |
CN114584455B (en) * | 2022-03-04 | 2023-06-30 | 吉林大学 | Small and medium-sized high-performance cluster monitoring system based on enterprise WeChat |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107480717A (en) * | 2017-08-16 | 2017-12-15 | 北京奇虎科技有限公司 | Train job processing method and system, computing device, computer-readable storage medium |
CN107766148A (en) * | 2017-08-31 | 2018-03-06 | 北京百度网讯科技有限公司 | A kind of isomeric group and task processing method and device |
CN108564164A (en) * | 2018-01-08 | 2018-09-21 | 中山大学 | A kind of parallelization deep learning method based on SPARK platforms |
US20180314926A1 (en) * | 2017-04-28 | 2018-11-01 | Intel Corporation | Smart memory handling and data management for machine learning networks |
CN109086134A (en) * | 2018-07-19 | 2018-12-25 | 郑州云海信息技术有限公司 | A kind of operation method and device of deep learning operation |
CN109409738A (en) * | 2018-10-25 | 2019-03-01 | 平安科技(深圳)有限公司 | Method, the electronic device of deep learning are carried out based on block platform chain |
CN110018817A (en) * | 2018-01-05 | 2019-07-16 | 中兴通讯股份有限公司 | The distributed operation method and device of data, storage medium and processor |
CN110413294A (en) * | 2019-08-06 | 2019-11-05 | 中国工商银行股份有限公司 | Service delivery system, method, apparatus and equipment |
CN111327692A (en) * | 2020-02-05 | 2020-06-23 | 北京百度网讯科技有限公司 | Model training method and device and cluster system |
-
2020
- 2020-02-05 CN CN202010080825.4A patent/CN111327692A/en active Pending
- 2020-09-25 WO PCT/CN2020/117723 patent/WO2021155667A1/en active Application Filing
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180314926A1 (en) * | 2017-04-28 | 2018-11-01 | Intel Corporation | Smart memory handling and data management for machine learning networks |
CN107480717A (en) * | 2017-08-16 | 2017-12-15 | 北京奇虎科技有限公司 | Train job processing method and system, computing device, computer-readable storage medium |
CN107766148A (en) * | 2017-08-31 | 2018-03-06 | 北京百度网讯科技有限公司 | A kind of isomeric group and task processing method and device |
CN110018817A (en) * | 2018-01-05 | 2019-07-16 | 中兴通讯股份有限公司 | The distributed operation method and device of data, storage medium and processor |
CN108564164A (en) * | 2018-01-08 | 2018-09-21 | 中山大学 | A kind of parallelization deep learning method based on SPARK platforms |
CN109086134A (en) * | 2018-07-19 | 2018-12-25 | 郑州云海信息技术有限公司 | A kind of operation method and device of deep learning operation |
CN109409738A (en) * | 2018-10-25 | 2019-03-01 | 平安科技(深圳)有限公司 | Method, the electronic device of deep learning are carried out based on block platform chain |
CN110413294A (en) * | 2019-08-06 | 2019-11-05 | 中国工商银行股份有限公司 | Service delivery system, method, apparatus and equipment |
CN111327692A (en) * | 2020-02-05 | 2020-06-23 | 北京百度网讯科技有限公司 | Model training method and device and cluster system |
Also Published As
Publication number | Publication date |
---|---|
CN111327692A (en) | 2020-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021155667A1 (en) | Model training method and apparatus, and clustering system | |
US11150952B2 (en) | Accelerating and maintaining large-scale cloud deployment | |
JP7421511B2 (en) | Methods and apparatus, electronic devices, readable storage media and computer programs for deploying applications | |
US9606824B2 (en) | Administering virtual machines in a distributed computing environment | |
US10255097B2 (en) | Administering virtual machines in a distributed computing environment | |
JP7170768B2 (en) | Development machine operation task processing method, electronic device, computer readable storage medium and computer program | |
US9503515B2 (en) | Administering virtual machines in a distributed computing environment | |
US9612857B2 (en) | Administering virtual machines in a distributed computing environment | |
US9703587B2 (en) | Administering virtual machines in a distributed computing environment | |
US8977752B2 (en) | Event-based dynamic resource provisioning | |
WO2013135016A1 (en) | Version construction system and method | |
CN114579250A (en) | Method, device and storage medium for constructing virtual cluster | |
CN110019059B (en) | Timing synchronization method and device | |
Li et al. | Improving spark performance with zero-copy buffer management and RDMA | |
Liu et al. | BSPCloud: A hybrid distributed-memory and shared-memory programming model | |
WO2021174791A1 (en) | Task migration method and apparatus, and electronic device and storage medium | |
Zhou et al. | Software-defined streaming-based code scheduling for transparent computing | |
CN115242786B (en) | Multi-mode big data job scheduling system and method based on container cluster | |
Lascu et al. | IBM zEnterprise EC12 technical guide | |
Li et al. | Collaborative Management System Driven by Task Flow in Supercomputing Environment | |
Zou et al. | Structural finite element method based on cloud computing | |
KR20210043523A (en) | Data mining system, method, apparatus, electronic device and storage medium | |
CN117742891A (en) | Virtual machine creation method, device and equipment with vDPA equipment and storage medium | |
CN118260036A (en) | Method, system and medium for processing Flink operation | |
CN112783610A (en) | Saltstack-based Ceph deployment host node |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20917276 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20917276 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15/03/2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20917276 Country of ref document: EP Kind code of ref document: A1 |