CN117891618A

CN117891618A - Resource task processing method and device of artificial intelligent model training platform

Info

Publication number: CN117891618A
Application number: CN202410298768.5A
Authority: CN
Inventors: 刘浩; 郑东; 赵五岳; 赵拯; 彭观海; 徐宇杰; 庄庆云
Original assignee: Universal Ubiquitous Technology Co ltd
Current assignee: Universal Ubiquitous Technology Co ltd
Priority date: 2024-03-15
Filing date: 2024-03-15
Publication date: 2024-04-16

Abstract

The embodiment of the application provides a resource task processing method and device of an artificial intelligent model training platform, wherein the method comprises the following steps: determining an assignable unit of the distributed model training platform according to the CPU parameters of the physical machine server, the memory parameters of the physical machine server and the preset display card parameters; dividing groups according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing corresponding virtual clusters according to at least one allocable unit group; receiving a model training request sent by a user, and performing calculation processing according to the calculation power demand in the model training request and an idle cloud computing service calculation power unit in a virtual cluster of the distributed model training platform to obtain a model training result; the application can effectively improve the utilization rate of platform resources during the training of the distributed artificial intelligent model.

Description

Resource task processing method and device of artificial intelligent model training platform

Technical Field

The application relates to the field of artificial intelligence, in particular to a resource task processing method and device of an artificial intelligence model training platform.

Background

Along with the development of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology, the demands of training an artificial intelligence model on computational effort and data are continuously expanded, single-machine training cannot meet the demands, a distributed training platform is an indispensable technology for the current training model, the training platform can realize the sharing of training resources, and sufficient computational effort resources are distributed according to the demands of user tasks. However, with the increase of the number of users and the increase of task submissions, the problems of resource fragmentation, low resource utilization rate and even deadlock of resources are easy to occur.

In the prior art, the problem of resource deadlock is generally solved by batch task scheduling (Gang Scheduler), namely, only when the required resources of the tasks are completely satisfied, the resources are allocated and the tasks are initiated, and in this way, the problem of resource deadlock is basically solved. The resource fragmentation and the resource utilization rate are low and mainly derive from several aspects, on the one hand, the fragmented resource allocation is realized, the computing resources mainly comprise CPU (central processing unit), memory and computing cards, when a task is submitted, a user can specify the number of the needed computing cards, the number of the CPU and the size of the memory, the too flexible resource allocation easily causes the fragmentation of the computing resources, for example, 3 teams share one physical cluster, the proportion of the assignable resources of the teams is set to be 40%,30% and 30%, and along with the progress of the task, the situation that although 40% of the resources are totally remained for allocation, 40% of the resources are scattered on a plurality of physical servers due to the fragmentation, so that the training task of multiple cards cannot be initiated is possible; on the other hand, if the scheduling is unreasonable, for example, a server with 28 cards, if the scheduling is unreasonable, the task occupying 1 card is easily arranged on the 2 cards, so that the distributed training of the single machine 8 cards cannot be initiated, and the resource utilization rate is low; yet another aspect is that existing resource approaches tend to be exclusive allocations, i.e., a card can only be allocated to one task without fully taking into account the characteristics of the different tasks.

Disclosure of Invention

Aiming at the problems in the prior art, the application provides a resource task processing method and device for an artificial intelligent model training platform, which can effectively improve the platform resource utilization rate during the training of a distributed artificial intelligent model.

In order to solve at least one of the problems, the application provides the following technical scheme:

in a first aspect, the present application provides a method for processing a resource task of an artificial intelligence model training platform, including:

Determining an assignable unit of the distributed model training platform according to the CPU parameters of the physical machine server, the memory parameters of the physical machine server and the preset display card parameters;

Dividing groups according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing corresponding virtual clusters according to at least one allocable unit group;

And receiving a model training request sent by a user, and performing calculation processing according to the calculation power demand in the model training request and an idle cloud computing service calculation power unit in a virtual cluster of the distributed model training platform to obtain a model training result.

Further, the determining the allocatable unit according to the CPU parameter of the physical machine server, the memory parameter of the physical machine server, and the preset graphics card parameter includes:

Determining corresponding parameters of the CPU of the assignable unit according to the total core number of the CPU of the physical machine server, the reserved minimum core number and the number of cloud computing service computing units corresponding to the CPU of the physical machine server;

Determining corresponding memory parameters of the assignable units according to the total memory number of the physical machine server, the reserved minimum memory and the number of cloud computing service computing units corresponding to the memory of the physical machine server;

And determining the allocable unit according to the allocable unit CPU parameter, the allocable unit memory parameter and a preset display card parameter.

Further, the grouping according to the physical machine server to which the allocable unit belongs to obtain a plurality of allocable unit groups, and constructing a corresponding virtual cluster according to at least one allocable unit group, including:

organizing the allocatable units on the physical machine server with the same identifier into at least one allocatable unit group, and constructing a corresponding virtual cluster according to the allocatable unit group and a preset computing resource allocation strategy;

And monitoring the resource utilization rate of each allocatable unit group in the virtual cluster through a preset task scheduler, and releasing the computing resource of the allocatable unit group after the resource utilization rate of the allocatable unit group is lower than a preset threshold value to obtain the updated virtual cluster.

Further, the receiving the model training request sent by the user performs calculation processing according to the calculation power requirement in the model training request and the idle cloud computing service calculation power unit in the virtual cluster of the distributed model training platform to obtain a model training result, and the method includes:

Receiving a model training request sent by a user, and determining a corresponding assignable unit group, the number of assigned tasks of a father group of each assignable unit group and the number of remaining assignable units according to the resource requirements in the model training request and idle cloud computing service computing units of the distributed model training platform;

And distributing the training task corresponding to the model training request to the distributable unit group with the largest distributed task number in the distributable unit groups with the residual distributable unit numbers meeting the training task, and performing calculation processing to obtain a model training result.

Further, the receiving the model training request sent by the user performs calculation processing according to the calculation power requirement in the model training request and the idle cloud computing service calculation power unit in the virtual cluster of the distributed model training platform to obtain a model training result, and further includes:

Receiving a model training request sent by a user, and if the model training request corresponds to a data processing task, determining a corresponding estimated resource occupancy rate according to a network structure and batch processing information of the data processing task;

And determining a corresponding allocable unit group according to the estimated resource occupancy rate, and allocating the data processing tasks to the allocable unit group for calculation processing to obtain a model training result, wherein a plurality of data processing tasks share the same allocable unit group.

Receiving a model training request sent by a user, and if the model training request corresponds to an algorithm development task, selecting a virtual cluster constructed by a set low-calculation-force assignable unit group to perform calculation processing according to task requirements in the model training request to obtain a model training result;

and monitoring whether the CPU parameters of the virtual cluster and the highest utilization rate of the cloud computing service computing power unit are lower than a preset lowest threshold according to a set time period, if so, stopping the algorithm development task, and releasing the computing resources of the virtual cluster.

Receiving a model training request sent by a user, and if the model training request corresponds to an algorithm training task, calculating according to task requirements in the model training request through the virtual cluster to obtain a model training result;

In a second aspect, the present application provides a resource task processing device of an artificial intelligence model training platform, including:

The distributed unit determining module is used for determining the distributed unit of the distributed model training platform according to the CPU parameters of the physical machine server, the memory parameters of the physical machine server and the preset display card parameters;

The virtual cluster construction module is used for dividing groups according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing a corresponding virtual cluster according to at least one allocable unit group;

the model training module is used for receiving a model training request sent by a user, and carrying out calculation processing according to the calculation power demand in the model training request and the idle cloud computing service calculation power units in the virtual cluster of the distributed model training platform to obtain a model training result.

In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the resource task processing method of the artificial intelligence model training platform when the program is executed.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the resource task processing method of the artificial intelligence model training platform.

In a fifth aspect, the present application provides a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the resource task processing method of the artificial intelligence model training platform.

According to the technical scheme, the application provides a resource task processing method and device of an artificial intelligent model training platform, and an allocatable unit of the distributed model training platform is determined according to CPU parameters of a physical machine server, memory parameters of the physical machine server and preset display card parameters; dividing groups according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing corresponding virtual clusters according to at least one allocable unit group; and receiving a model training request sent by a user, and performing calculation processing according to the calculation power demand in the model training request and idle cloud computing service calculation power units in the virtual cluster of the distributed model training platform to obtain a model training result, so that the platform resource utilization rate during the training of the distributed artificial intelligent model can be effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a resource task processing method of an artificial intelligence model training platform according to an embodiment of the present application;

FIG. 2 is a second flow chart of a resource task processing method of an artificial intelligence model training platform according to an embodiment of the present application;

FIG. 3 is a third flow chart of a resource task processing method of the artificial intelligence model training platform according to the embodiment of the application;

FIG. 4 is a flowchart of a resource task processing method of an artificial intelligence model training platform according to an embodiment of the present application;

FIG. 5 is a flowchart of a resource task processing method of an artificial intelligence model training platform according to an embodiment of the present application;

FIG. 6 is a flowchart of a resource task processing method of an artificial intelligence model training platform according to an embodiment of the present application;

FIG. 7 is a flow chart of a method for processing resource tasks of an artificial intelligence model training platform according to an embodiment of the present application;

FIG. 8 is a block diagram of a resource task processing device of an artificial intelligence model training platform in an embodiment of the application;

Fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The technical scheme of the application obtains, stores, uses, processes and the like the data, which all meet the relevant regulations of national laws and regulations.

In consideration of the problems of resource fragmentation and low resource utilization rate in the prior art, the application provides a resource task processing method and device of an artificial intelligent model training platform, which are used for determining an assignable unit of the distributed model training platform according to CPU parameters of a physical machine server, memory parameters of the physical machine server and preset display card parameters; dividing groups according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing corresponding virtual clusters according to at least one allocable unit group; and receiving a model training request sent by a user, and performing calculation processing according to the calculation power demand in the model training request and idle cloud computing service calculation power units in the virtual cluster of the distributed model training platform to obtain a model training result, so that the platform resource utilization rate during the training of the distributed artificial intelligent model can be effectively improved.

In order to effectively improve the platform resource utilization rate during the training of the distributed artificial intelligent model, the application provides an embodiment of a resource task processing method of an artificial intelligent model training platform, referring to fig. 1, the resource task processing method of the artificial intelligent model training platform specifically comprises the following contents:

Step S101: determining an assignable unit of the distributed model training platform according to the CPU parameters of the physical machine server, the memory parameters of the physical machine server and the preset display card parameters;

optionally, in this embodiment, in order to reduce resource fragmentation from the source, the allocable unit is limited, and the allocable unit is unique to a computing unit (computing card) of a cloud computing service.

It can be understood that the allocable unit in this embodiment is composed of a CPU, a memory, and a graphics card, where the graphics card is fixed to 1, and the CPU and the memory are determined according to the configuration of the physical machine, and the configuration of the same class of computing cards is the same when the physical machine is assembled. The calculation mode of the distributable unit is as follows: Wherein,/> is a downward rounding, and the reserved core number and the memory are resources for ensuring the normal operation of the system.

Optionally, in this embodiment, when determining the allocatable unit of the distributed model training platform, the CPU parameters, the memory parameters and the preset graphics card parameters of the physical machine server need to be comprehensively considered first. Such an allocatable unit is typically a hardware unit that contains a graphics card, and associated CPU and memory resources. First, the hardware parameters of each physical machine server may be collected, including:

CPU parameters: including model number, core number, clock frequency, etc.

Memory parameters: including total memory capacity, memory type, memory frequency, etc.

Display card parameters: including graphics card model, memory capacity, computing performance, etc.

Such an allocatable unit is a basic computational unit in a distributed model training platform that will be dynamically allocated and managed across the platform to meet the computational requirements of different tasks. The design considers the whole hardware configuration of the physical machine server, and ensures that the computing resources can be fully utilized in the training process.

Step S102: dividing groups according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing corresponding virtual clusters according to at least one allocable unit group;

optionally, in this embodiment, the communication speed across machines is slower, so that the multi-card distributed training efficiency on the same machine is higher than that of multi-machine multi-card. The method comprises the steps of dividing the allocatable units into groups according to the physical machines in which the allocatable units are located, wherein the allocatable units in the same group are located in the same physical machine, one physical machine can be divided into a plurality of allocatable unit groups, and then cards on the same physical machine can be divided into different virtual clusters.

Alternatively, during operation of the cluster, the allocatable unit group is dynamic, e.g. an 8-card allocatable unit group after allocating 2 cards to a task, the remaining 6 cards form a new allocatable unit group, where the 8-card allocatable unit group is the parent of the new allocatable unit.

Optionally, in this embodiment, based on the physical clusters, the physical clusters may be divided into multiple virtual clusters, and the virtual clusters are allocated to different teams for use, so that the unified management of the physical clusters is realized while the isolation of resources is realized, and multiple clusters do not need to be built for multiple teams. Compared with the common allocation of resources in proportion, the method can ensure that the allocated resources are as close as possible physically, and reduce cross-machine communication. Each virtual cluster is made up of one or more groups of allocatable units, which can be allocated to a single team or to multiple teams for sharing.

Optionally, in this embodiment, the group may be divided according to a physical machine server to which the allocable unit belongs, so as to obtain a plurality of allocable unit groups. Each allocatable unit group consists of a plurality of allocatable units on the same physical machine, and a resource group (namely, an allocatable unit group) is formed. Then, based on these allocatable cell groups, a corresponding virtual cluster can be constructed.

Specifically, for each physical machine server, the allocatable units thereon are divided into different groups. The allocatable units within each group may share the same hardware resources, e.g., the same CPU, memory, and preset graphics card parameters. Such organization helps to more effectively manage and utilize the resources of each physical machine. And obtaining a plurality of allocatable unit groups by performing group division on allocatable units on all physical machines. The allocatable units within each group have similar hardware characteristics, but there may be differences in hardware configuration between the different groups. A corresponding virtual cluster is built for each allocable unit group.

It will be appreciated that a virtual cluster is a logically organized unit that may contain allocatable units from different physical machines. Such virtual clusters can better accommodate the needs of different tasks, providing more flexible resource management. For each virtual cluster, relevant parameters such as cluster capacity, computing resource allocation policies, etc. are configured. These configurations can be adjusted according to task load and performance requirements to meet the requirements of different application scenarios.

Meanwhile, in the virtual cluster operation process, a dynamic adjustment mechanism is realized. For example, when the task load in a certain virtual cluster is light, some of the allocatable units in that cluster may be dynamically moved to other clusters to better utilize the resources.

The virtual cluster construction and management mode enables the system to be more flexible and adaptive, and flexible resource allocation and management can be carried out according to the characteristics and requirements of different tasks. This is important for performance optimization and task scheduling of the distributed model training platform.

Step S103: and receiving a model training request sent by a user, and performing calculation processing according to the calculation power demand in the model training request and an idle cloud computing service calculation power unit in a virtual cluster of the distributed model training platform to obtain a model training result.

Optionally, in this embodiment, a model training request is received from the user, where the request may include information about model architecture, data set, training super parameters, and the like. This information will be used to allocate the appropriate computing resources. The task requirements in the analytical model training request include required computing resources, training data, and the like. Including, for example, the type, number, and other associated hardware and software configurations of graphics cards required by the model. Then, the idle computing resources on the distributed model training platform are queried. This may involve examining the state of the allocatable cell groups to determine which virtual clusters have free resources on them to meet the user's task needs. And dynamically distributing the resources according to the task requirements and the idle computing resources on the platform. An appropriate virtual cluster is selected, a corresponding number of allocable units are allocated, and sufficient computing resources are ensured to complete the model training task.

Optionally, in this embodiment, when a new model training task arrives, the allocation resources may be searched according to the following rule:

(1) Searching the assignable unit group with the nearest idle card number (namely the idle cloud computing service computing power unit of the distributed model training platform) and task demand, and if only one of the assignable unit groups is available, assigning the assignable unit group to the group. If there are multiple groups of alternatively allocable units, then proceed.

(2) And searching the number of tasks allocated to the parent group of the to-be-selected allocatable unit group, and allocating the tasks to the group with more tasks.

By the method, the continuity of the spare allocatable unit groups can be ensured, and the allocatable unit groups with more resources are reserved as much as possible, so that the task with larger resource requirements is met.

For example, existing AI training platforms generally divide tasks into algorithm development and algorithm training, and manage the resources of these two types of tasks separately. However, there is a common task in deep learning: and (5) data processing. The three tasks are characterized in that:

and (3) developing an algorithm: the resource occupation has uncertainty, most of the time the resource is idle, and the resource occupation rate can be increased during debugging.

And (3) data processing: the resource occupation is stable, and compared with algorithm training, the utilization rate of the computing card is lower.

Algorithm training: the resource occupation is stable, and the utilization rate of the computing card is high

For these task features, we have formulated the following task scheduling management policies:

(1) For the algorithm development task, a public virtual cluster is established and consists of a computing card with lower computing power. The task allocation adopts exclusive type, so that development efficiency is prevented from being influenced. The CPU and the computing card utilization rate of the task is monitored in real time, the highest utilization rate of the CPU and the computing card is lower than a threshold value within a certain time (for example, 1 hour), an alarm is triggered, a task initiator is reminded, the highest utilization rate is still lower than the threshold value within a certain time (for example, 1 hour) after the alarm, the task is stopped, and resources are released.

(2) For data processing tasks, a virtual cluster is shared with training tasks, and the task allocation adopts sharing. Initiating a data processing task, filling in the network structure and batch size information, and recording the relationship between the network structure, the batch size and the average resource occupancy rate of the task by a platform. When a new data processing task arrives, if no data processing task exists in the current platform, directly initiating and marking the allocated unit group as sharing, otherwise, predicting the resource occupation according to the relation between the network structure recorded by the platform, the batch processing size and the resource occupation, searching the resource group of which the residual resource meets the requirement in the running data processing task, allocating the task in the task group, and if the resource group does not meet the requirement, allocating the new resource group and marking sharing. In this way, computing resources may be utilized more efficiently.

(3) For algorithm training tasks, the CPU and the computing card utilization rate of the tasks are distributed in a single sharing mode according to rules, the CPU and the computing card utilization rate of the tasks are monitored in real time, the CPU and the computing card are lower than a threshold value within a certain time (for example, 1 hour), an alarm is triggered, a task initiator is reminded, the task is stopped after the alarm is given that the highest utilization rate is still lower than the threshold value within a certain time (for example, 1 hour), resources are released, and the inefficient tasks are prevented from occupying resources for a long time.

As can be seen from the above description, the resource task processing method of the artificial intelligent model training platform provided by the embodiment of the application can determine the allocable units of the distributed model training platform according to the CPU parameters of the physical machine server, the memory parameters of the physical machine server and the preset graphics card parameters; dividing groups according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing corresponding virtual clusters according to at least one allocable unit group; and receiving a model training request sent by a user, and performing calculation processing according to the calculation power demand in the model training request and idle cloud computing service calculation power units in the virtual cluster of the distributed model training platform to obtain a model training result, so that the platform resource utilization rate during the training of the distributed artificial intelligent model can be effectively improved.

In an embodiment of the method for processing resource tasks of the artificial intelligence model training platform of the present application, referring to fig. 2, the method may further specifically include the following:

Step S201: determining corresponding parameters of the CPU of the assignable unit according to the total core number of the CPU of the physical machine server, the reserved minimum core number and the number of cloud computing service computing units corresponding to the CPU of the physical machine server;

step S202: determining corresponding memory parameters of the assignable units according to the total memory number of the physical machine server, the reserved minimum memory and the number of cloud computing service computing units corresponding to the memory of the physical machine server;

step S203: and determining the allocable unit according to the allocable unit CPU parameter, the allocable unit memory parameter and a preset display card parameter.

In an embodiment of the method for processing resource tasks of the artificial intelligence model training platform of the present application, referring to fig. 3, the method may further specifically include the following:

Step S301: organizing the allocatable units on the physical machine server with the same identifier into at least one allocatable unit group, and constructing a corresponding virtual cluster according to the allocatable unit group and a preset computing resource allocation strategy;

Step S302: and monitoring the resource utilization rate of each allocatable unit group in the virtual cluster through a preset task scheduler, and releasing the computing resource of the allocatable unit group after the resource utilization rate of the allocatable unit group is lower than a preset threshold value to obtain the updated virtual cluster.

Alternatively, in this embodiment, the allocatable units on the physical machine servers having the same identifier may be organized into at least one allocatable unit group. This means that the allocatable units from the same physical machine will be organized together to form a resource group. And constructing a corresponding virtual cluster based on the organized allocable unit groups and a preset computing resource allocation strategy. For example, and considering the nature of the task, hardware configuration, task scheduling priority, etc., to determine the composition of the virtual clusters.

Optionally, this embodiment may further design and implement a preset task scheduler, which is responsible for monitoring the resource utilization of each allocable unit group in the virtual cluster. This scheduler needs to know the status of each allocatable unit group, including computing resource utilization, task execution status, etc. The task scheduler periodically monitors the resource utilization rate of each allocatable unit group in the virtual cluster, including an allocated unit group to which a task has been allocated and an allocated unit group to which a task has not been allocated, specifically, taking the monitored resource utilization rate occupied by a task as an example, a unit group originally having 8 cards, wherein 2 cards are allocated to a task, and the remaining 6 cards form a new allocatable unit group. And monitoring the allocated 2-card resources, and if the utilization rate is low, releasing the resources. The resources occupied by the 2-card task can be called as allocated unit groups, so that the resources can be conveniently distinguished. This can be achieved by collecting information on the operating status, task load and resource consumption of the allocatable units within each group.

When the resource utilization rate of a certain allocatable unit group is lower than a preset threshold value, the task scheduler takes corresponding measures. This may include freeing the group of computing resources, marking them as idle, for reassignment when a higher priority task arrives in the future.

And the task scheduler obtains the updated virtual cluster state by releasing the resources. This may result in a recombination or reassignment of the groups of allocatable units in the virtual cluster to accommodate new task demands and system states.

Therefore, the embodiment realizes flexible allocation of computing resources by monitoring and managing the resource utilization rate of each allocatable unit group in the virtual cluster. The dynamic adjustment mechanism can better cope with the change of task load, and ensure that the system efficiently utilizes computing resources.

In an embodiment of the method for processing resource tasks of the artificial intelligence model training platform of the present application, referring to fig. 4, the method may further specifically include the following:

Step S401: receiving a model training request sent by a user, and determining a corresponding assignable unit group, the number of assigned tasks of a father group of each assignable unit group and the number of remaining assignable units according to the resource requirements in the model training request and idle cloud computing service computing units of the distributed model training platform;

step S402: and distributing the training task corresponding to the model training request to the distributable unit group with the largest distributed task number in the distributable unit groups with the residual distributable unit numbers meeting the training task, and performing calculation processing to obtain a model training result.

Optionally, in this embodiment, a model training request may be received from the user side, where the request includes information such as a model architecture, a data set, and training super parameters. The task requirements in the analytical model training request include required computing resources, training data, and the like. This may include graphics card type, quantity, and other related hardware and software configurations.

The present embodiment may then query for free computing resources on the distributed model training platform, including the status of the allocatable unit groups, the number of allocated tasks, the number of remaining allocatable tasks, and the like. Based on the task requirements and the available computing resources, a corresponding group of allocable units is determined. Factors such as hardware configuration, task priority, etc. may need to be considered. The training tasks are assigned to the assignable unit groups that meet the training requirements. The allocatable unit group with the largest number of allocated tasks and the remaining allocatable task number meeting the training task is selected to ensure better utilization of the resources.

Finally, the present embodiment initiates model training tasks on the selected allocatable cell groups. This may involve configuring a training environment, loading models and data on each allocatable unit, and then starting the training process. In the model training process, the state and performance of the task are monitored in real time. This includes tracking information on the progress of computation, resource utilization, etc. of each allocable unit group.

Therefore, the embodiment can make reasonable decisions according to the demands of users, the states of the system and the availability of resources. This helps ensure that the system fully utilizes computing resources while meeting model training requirements of different users.

In an embodiment of the method for processing resource tasks of the artificial intelligence model training platform of the present application, referring to fig. 5, the method may further specifically include the following:

Step S501: receiving a model training request sent by a user, and if the model training request corresponds to a data processing task, determining a corresponding estimated resource occupancy rate according to a network structure and batch processing information of the data processing task;

step S502: and determining a corresponding allocable unit group according to the estimated resource occupancy rate, and allocating the data processing tasks to the allocable unit group for calculation processing to obtain a model training result, wherein a plurality of data processing tasks share the same allocable unit group.

Optionally, in this embodiment, the data processing task request may be received from the user, where the data processing task request may include related information such as a network structure of the task, batch information, a data set, and the like. And estimating the occupancy rate of the task to the computing resource according to the network structure and batch processing information of the data processing task. This may include evaluating the type, number, and memory requirements of graphics cards required.

The present embodiment may then query for free computing resources on the distributed model training platform, including the status of the allocatable unit groups, the number of allocated tasks, the number of remaining allocatable tasks, and the like. And determining a corresponding allocable unit group according to the estimated resource occupancy rate. This may require consideration of hardware configuration, task priority, etc.

Finally, the present embodiment may assign the data processing tasks to the selected assignable unit groups. Ensuring that the task is able to perform the computational process while meeting the resource requirements.

Therefore, the embodiment can reasonably allocate and manage according to the nature of the task, the resource requirement and the system state. This helps ensure that the system efficiently utilizes computing resources while meeting the data processing requirements of different users.

In an embodiment of the method for processing resource tasks of the artificial intelligence model training platform of the present application, referring to fig. 6, the method may further specifically include the following:

Step S601: receiving a model training request sent by a user, and if the model training request corresponds to an algorithm development task, selecting a virtual cluster constructed by a set low-calculation-force assignable unit group to perform calculation processing according to task requirements in the model training request to obtain a model training result;

Step S602: and monitoring whether the CPU parameters of the virtual cluster and the highest utilization rate of the cloud computing service computing power unit are lower than a preset lowest threshold according to a set time period, if so, stopping the algorithm development task, and releasing the computing resources of the virtual cluster.

Alternatively, in this embodiment, in view of the low demand of the task for computing resources, the allocable unit group with low computing power is selected to construct the virtual cluster. Such a choice may balance resource utilization efficiency and cost. And distributing the algorithm development task to the constructed virtual cluster. And ensuring that each allocable unit group can meet the calculation requirement of the task, and starting an algorithm development process.

Then, the embodiment can monitor the CPU parameters of the virtual cluster and the highest utilization rate of the cloud computing service computing power unit according to the set time period. For example by collecting and analyzing in real time performance metrics for each of the allocatable cell groups. And judging whether the monitored virtual cluster resource utilization rate is lower than a preset minimum threshold value or not. This threshold is set, for example, according to system performance and task requirements. If the monitoring result shows that the resource utilization rate of the virtual cluster is lower than the preset minimum threshold, the system has higher idleness, and the current algorithm development task can be stopped. The computing resources of the virtual cluster are released so that other more urgent tasks can get more computing resources.

In an embodiment of the method for processing resource tasks of the artificial intelligence model training platform of the present application, referring to fig. 7, the method may further specifically include the following:

step S701: receiving a model training request sent by a user, and if the model training request corresponds to an algorithm training task, calculating according to task requirements in the model training request through the virtual cluster to obtain a model training result;

step S702: and monitoring whether the CPU parameters of the virtual cluster and the highest utilization rate of the cloud computing service computing power unit are lower than a preset lowest threshold according to a set time period, if so, stopping the algorithm development task, and releasing the computing resources of the virtual cluster.

Alternatively, in this embodiment, the algorithm training task may be assigned to the virtual cluster. And ensuring that each allocable unit group can meet the calculation requirement of a task, and starting an algorithm training process. And monitoring CPU parameters of the physical machine servers in the virtual clusters and the highest utilization rate of the cloud computing service computing units according to the set time period. This may be achieved by collecting and analyzing the performance metrics of each of the allocatable cell groups in real time.

And then, judging whether the monitored virtual cluster resource utilization rate is lower than a preset minimum threshold value or not. This threshold may be set according to system performance and task requirements. If the monitoring result shows that the resource utilization rate of the virtual cluster is lower than the preset minimum threshold, the system has higher idleness, and the current algorithm training task can be stopped. The computing resources of the virtual cluster are released so that other more urgent tasks can get more computing resources.

Therefore, the embodiment dynamically adjusts the resource allocation according to the nature of the task and the current system state so as to improve the overall resource utilization efficiency.

In order to effectively improve the platform resource utilization rate during the training of the distributed artificial intelligence model, the application provides an embodiment of a resource task processing device of an artificial intelligence model training platform for realizing all or part of the content of a resource task processing method of the artificial intelligence model training platform, referring to fig. 8, the resource task processing device of the artificial intelligence model training platform specifically comprises the following contents:

The allocatable unit determining module 10 is configured to determine an allocatable unit of the distributed model training platform according to the physical machine server CPU parameter, the physical machine server memory parameter, and the preset graphic card parameter;

The virtual cluster construction module 20 is configured to obtain a plurality of allocable unit groups according to group division of the physical machine servers to which the allocable units belong, and construct a corresponding virtual cluster according to at least one allocable unit group;

The model training module 30 is configured to receive a model training request sent by a user, and perform calculation processing according to a calculation power requirement in the model training request and an idle cloud computing service calculation power unit in a virtual cluster of the distributed model training platform, so as to obtain a model training result.

As can be seen from the above description, the resource task processing device of the artificial intelligent model training platform provided by the embodiment of the application can determine the allocatable units of the distributed model training platform according to the CPU parameters of the physical machine server, the memory parameters of the physical machine server and the preset graphics card parameters; dividing groups according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing corresponding virtual clusters according to at least one allocable unit group; and receiving a model training request sent by a user, and performing calculation processing according to the calculation power demand in the model training request and idle cloud computing service calculation power units in the virtual cluster of the distributed model training platform to obtain a model training result, so that the platform resource utilization rate during the training of the distributed artificial intelligent model can be effectively improved.

In order to effectively improve the platform resource utilization rate during the training of the distributed artificial intelligent model from the hardware level, the application provides an embodiment of an electronic device for realizing all or part of the content in the resource task processing method of the artificial intelligent model training platform, wherein the electronic device specifically comprises the following contents:

A processor (processor), a memory (memory), a communication interface (Communications Interface), and a bus; the processor, the memory and the communication interface complete communication with each other through the bus; the communication interface is used for realizing information transmission between the resource task processing device of the artificial intelligent model training platform and related equipment such as a core service system, a user terminal, a related database and the like; the logic controller may be a desktop computer, a tablet computer, a mobile terminal, etc., and the embodiment is not limited thereto. In this embodiment, the logic controller may refer to an embodiment of the resource task processing method of the artificial intelligence model training platform and an embodiment of the resource task processing device of the artificial intelligence model training platform, and the contents thereof are incorporated herein and are not repeated here.

It is understood that the user terminal may include a smart phone, a tablet electronic device, a network set top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), a vehicle-mounted device, a smart wearable device, etc. Wherein, intelligent wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..

In practical application, part of the resource task processing method of the artificial intelligent model training platform can be executed on the electronic device side as described above, or all operations can be completed in the client device. Specifically, the selection may be made according to the processing capability of the client device, and restrictions of the use scenario of the user. The application is not limited in this regard. If all operations are performed in the client device, the client device may further include a processor.

The client device may have a communication module (i.e. a communication unit) and may be connected to a remote server in a communication manner, so as to implement data transmission with the server. The server may include a server on the side of the task scheduling center, and in other implementations may include a server of an intermediate platform, such as a server of a third party server platform having a communication link with the task scheduling center server. The server may include a single computer device, a server cluster formed by a plurality of servers, or a server structure of a distributed device.

Fig. 9 is a schematic block diagram of a system configuration of an electronic device 9600 according to an embodiment of the present application. As shown in fig. 9, the electronic device 9600 may include a central processor 9100 and a memory 9140; the memory 9140 is coupled to the central processor 9100. Notably, this fig. 9 is exemplary; other types of structures may also be used in addition to or in place of the structures to implement telecommunications functions or other functions.

In one embodiment, the resource task processing method functionality of the artificial intelligence model training platform may be integrated into the central processor 9100. The central processor 9100 may be configured to perform the following control:

As can be seen from the above description, the electronic device provided by the embodiment of the present application determines the allocable units of the distributed model training platform according to the CPU parameters of the physical machine server, the memory parameters of the physical machine server, and the preset graphics card parameters; dividing groups according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing corresponding virtual clusters according to at least one allocable unit group; and receiving a model training request sent by a user, and performing calculation processing according to the calculation power demand in the model training request and idle cloud computing service calculation power units in the virtual cluster of the distributed model training platform to obtain a model training result, so that the platform resource utilization rate during the training of the distributed artificial intelligent model can be effectively improved.

In another embodiment, the resource task processing device of the artificial intelligent model training platform may be configured separately from the central processor 9100, for example, the resource task processing device of the artificial intelligent model training platform may be configured as a chip connected to the central processor 9100, and the function of the resource task processing method of the artificial intelligent model training platform is implemented by the control of the central processor.

As shown in fig. 9, the electronic device 9600 may further include: a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is noted that the electronic device 9600 need not include all of the components shown in fig. 9; in addition, the electronic device 9600 may further include components not shown in fig. 9, and reference may be made to the related art.

As shown in fig. 9, the central processor 9100, sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, which central processor 9100 receives inputs and controls the operation of the various components of the electronic device 9600.

The memory 9140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information about failure may be stored, and a program for executing the information may be stored. And the central processor 9100 can execute the program stored in the memory 9140 to realize information storage or processing, and the like.

The input unit 9120 provides input to the central processor 9100. The input unit 9120 is, for example, a key or a touch input device. The power supply 9170 is used to provide power to the electronic device 9600. The display 9160 is used for displaying display objects such as images and characters. The display may be, for example, but not limited to, an LCD display.

The memory 9140 may be a solid state memory such as Read Only Memory (ROM), random Access Memory (RAM), SIM card, etc. But also a memory which holds information even when powered down, can be selectively erased and provided with further data, an example of which is sometimes referred to as EPROM or the like. The memory 9140 may also be some other type of device. The memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application/function storage portion 9142, the application/function storage portion 9142 storing application programs and function programs or a flow for executing operations of the electronic device 9600 by the central processor 9100.

The memory 9140 may also include a data store 9143, the data store 9143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver storage portion 9144 of the memory 9140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, address book applications, etc.).

The communication module 9110 is a transmitter/receiver 9110 that transmits and receives signals via an antenna 9111. A communication module (transmitter/receiver) 9110 is coupled to the central processor 9100 to provide input signals and receive output signals, as in the case of conventional mobile communication terminals.

Based on different communication technologies, a plurality of communication modules 9110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, etc., may be provided in the same electronic device. The communication module (transmitter/receiver) 9110 is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and to receive audio input from the microphone 9132 to implement usual telecommunications functions. The audio processor 9130 can include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 9130 is also coupled to the central processor 9100 so that sound can be recorded locally through the microphone 9132 and sound stored locally can be played through the speaker 9131.

The embodiment of the present application further provides a computer readable storage medium capable of implementing all the steps in the resource task processing method of the artificial intelligence model training platform in which the execution subject is a server or a client, and the computer readable storage medium stores a computer program thereon, and the computer program when executed by a processor implements all the steps in the resource task processing method of the artificial intelligence model training platform in which the execution subject is a server or a client, for example, the processor implements the following steps when executing the computer program:

As can be seen from the above description, the computer readable storage medium provided by the embodiments of the present application determines the allocable units of the distributed model training platform according to the CPU parameters of the physical machine server, the memory parameters of the physical machine server, and the preset graphics card parameters; dividing groups according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing corresponding virtual clusters according to at least one allocable unit group; and receiving a model training request sent by a user, and performing calculation processing according to the calculation power demand in the model training request and idle cloud computing service calculation power units in the virtual cluster of the distributed model training platform to obtain a model training result, so that the platform resource utilization rate during the training of the distributed artificial intelligent model can be effectively improved.

The embodiment of the present application further provides a computer program product capable of implementing all the steps in the resource task processing method of the artificial intelligent model training platform in which the execution subject is a server or a client, where the steps of the resource task processing method of the artificial intelligent model training platform are implemented when the computer program/instructions are executed by a processor, for example, the computer program/instructions implement the steps of:

As can be seen from the above description, the computer program product provided by the embodiments of the present application determines the allocable units of the distributed model training platform according to the CPU parameters of the physical machine server, the memory parameters of the physical machine server, and the preset graphics card parameters; dividing groups according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing corresponding virtual clusters according to at least one allocable unit group; and receiving a model training request sent by a user, and performing calculation processing according to the calculation power demand in the model training request and idle cloud computing service calculation power units in the virtual cluster of the distributed model training platform to obtain a model training result, so that the platform resource utilization rate during the training of the distributed artificial intelligent model can be effectively improved.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principles and embodiments of the present invention have been described in detail with reference to specific examples, which are provided to facilitate understanding of the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method for processing resource tasks of an artificial intelligence model training platform, the method comprising:

2. The method for processing resource tasks of an artificial intelligence model training platform according to claim 1, wherein determining the allocatable unit according to the physical machine server CPU parameter, the physical machine server memory parameter, and the preset graphic card parameter comprises:

3. The method for processing resource tasks of an artificial intelligence model training platform according to claim 1, wherein the grouping according to the physical machine server to which the allocable units belong to obtain a plurality of allocable unit groups, and constructing a corresponding virtual cluster according to at least one allocable unit group, includes:

4. The method for processing resource tasks of an artificial intelligence model training platform according to claim 1, wherein the receiving the model training request sent by the user performs calculation processing according to the calculation power requirement in the model training request and an idle cloud computing service calculation power unit in a virtual cluster of the distributed model training platform to obtain a model training result, and includes:

5. The method for processing resource tasks of an artificial intelligence model training platform according to claim 1, wherein the receiving the model training request sent by the user performs calculation processing according to the calculation power requirement in the model training request and an idle cloud computing service calculation power unit in a virtual cluster of the distributed model training platform to obtain a model training result, and further comprises:

6. The method for processing resource tasks of an artificial intelligence model training platform according to claim 1, wherein the receiving the model training request sent by the user performs calculation processing according to the calculation power requirement in the model training request and an idle cloud computing service calculation power unit in a virtual cluster of the distributed model training platform to obtain a model training result, and further comprises:

7. The method for processing resource tasks of an artificial intelligence model training platform according to claim 1, wherein the receiving the model training request sent by the user performs calculation processing according to the calculation power requirement in the model training request and an idle cloud computing service calculation power unit in a virtual cluster of the distributed model training platform to obtain a model training result, and further comprises:

8. A resource task processing device of an artificial intelligence model training platform, the device comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the resource task processing method of the artificial intelligence model training platform of any one of claims 1 to 7 when the program is executed by the processor.

10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the resource task processing method of an artificial intelligence model training platform according to any one of claims 1 to 7.