CN114816758B - Resource allocation method and device - Google Patents

Resource allocation method and device Download PDF

Info

Publication number
CN114816758B
CN114816758B CN202210511762.2A CN202210511762A CN114816758B CN 114816758 B CN114816758 B CN 114816758B CN 202210511762 A CN202210511762 A CN 202210511762A CN 114816758 B CN114816758 B CN 114816758B
Authority
CN
China
Prior art keywords
storage space
training data
size
determining
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210511762.2A
Other languages
Chinese (zh)
Other versions
CN114816758A (en
Inventor
李晓晨
叶方捷
谭荣
郑小裕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210511762.2A priority Critical patent/CN114816758B/en
Publication of CN114816758A publication Critical patent/CN114816758A/en
Application granted granted Critical
Publication of CN114816758B publication Critical patent/CN114816758B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals

Abstract

The disclosure provides a resource allocation method and device, relates to the technical field of artificial intelligence, and particularly relates to the technical field of cloud computing and machine learning. The specific implementation scheme is as follows: acquiring task parameters of a model training task to be allocated to a container; determining resource configuration information of a container according to task parameters of a model training task; and configuring corresponding resources for the container according to the resource configuration information. Therefore, by combining the model training task, the resource configuration information required by the container to execute the model training task is accurately determined, and the resources are accurately allocated to the container based on the resource configuration information, so that the automatic allocation of the container resources is realized, and the allocation accuracy of the container resources is improved.

Description

Resource allocation method and device
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of cloud computing and machine learning technologies, and in particular, to a resource allocation method and apparatus.
Background
With the development of the internet, the cloud computing is more and more emphasized, and the container cloud platform is a cloud computing-based container virtualization platform. The method is a main mode for large-scale machine learning training by performing machine learning tasks on a container cloud platform.
In the related art, when a container cloud platform is used for performing a machine learning task, a model training task is usually allocated to a container before the container is started to perform the training task, and resource configuration information is manually allocated to the container. However, the problem of inaccurate resource allocation is easily caused by the way of manually allocating the resource allocation information, which causes failure of the model training task or resource waste.
Disclosure of Invention
The disclosure provides a resource allocation method, a resource allocation device, an electronic device, a storage medium and a computer program product.
According to an aspect of the present disclosure, there is provided a resource allocation method, including: acquiring task parameters of a model training task to be allocated to a container; determining resource configuration information of the container according to task parameters of the model training task; and configuring corresponding resources for the container according to the resource configuration information.
According to a second aspect of the present disclosure, there is provided a resource allocation apparatus, including: the acquisition module is used for acquiring task parameters of a model training task to be allocated to a container; the determining module is used for determining the resource configuration information of the container according to the task parameters of the model training task; and the configuration module is used for configuring corresponding resources for the container according to the resource configuration information.
According to a third aspect of the present disclosure, there is provided an electronic device, wherein the electronic device comprises a processor and a memory; the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the resource allocation method provided by the first aspect.
According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program comprising program that when executed by a processor implements the resource allocation method as provided in the first aspect above. A computer program product comprising instructions which, when executed by a processor of the computer program product, implement the resource allocation method as provided in the first aspect above.
According to a fifth aspect of the present disclosure, there is provided a computer program product, characterized in that when executed by an instruction processor in the computer program product, the resource allocation method as provided in the above first aspect is implemented.
One embodiment in the above application has the following advantages or benefits:
according to the resource allocation method and device provided by the disclosure, the resource allocation information required by the container to execute the model training task is accurately determined by combining the model training task, and the resource is accurately allocated to the container based on the resource allocation information, so that the automatic allocation of the container resource is realized, and the allocation accuracy of the container resource is improved.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;
FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;
FIG. 3 is a schematic illustration according to a third embodiment of the present disclosure;
FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;
FIG. 5 is a schematic illustration according to a fifth embodiment of the present disclosure;
FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;
FIG. 7 is a schematic illustration according to a seventh embodiment of the present disclosure;
FIG. 8 is a schematic illustration according to an eighth embodiment of the present disclosure;
FIG. 9 is a schematic diagram according to a ninth embodiment of the present disclosure;
FIG. 10 is a schematic diagram according to a tenth embodiment of the present disclosure;
fig. 11 is a block diagram of an electronic device for implementing a resource allocation method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in fig. 1, the resource allocation method includes the following steps:
step 101, obtaining task parameters of a model training task to be allocated to a container.
It should be noted that the resource allocation method provided in the embodiment of the present disclosure is applied to a resource allocation apparatus, where the resource allocation apparatus may be implemented in a software and/or hardware manner, and the resource allocation apparatus may be an electronic device, and may also be configured in the electronic device, and the electronic device in the embodiment of the present disclosure may be a device such as a PC (Personal Computer), a mobile device, a tablet Computer, a terminal device, or a server, and is not limited specifically here.
It should be noted that, in this embodiment, the resource allocation device is used as an electronic device, and the electronic device may perform resource allocation management for a container in the container cloud platform.
In one embodiment of the present disclosure, the task parameters of the model training task may include a training data set, a machine learning algorithm, and a preset training duration. Wherein one or more task parameters may be selected according to the actual task needs.
In an embodiment of the present disclosure, under a condition that resource allocation needs to be performed on a container, a model training task to be allocated to the container may be obtained, and a task parameter corresponding to the model training task to be allocated to the container may be obtained according to a correspondence between a stored model training task and a task parameter.
And 102, determining resource configuration information of the container according to the task parameters of the model training task.
The resource configuration information refers to the relevant configuration of resources required by the container when the model training task is executed on the container, wherein the relevant configuration is used for limiting how to select the resources required by the container from the resource pool. For example, the relevant configuration may define the size, type, or location of the resource, etc.
In one embodiment of the present disclosure, the resource configuration information of the container may include, but is not limited to, at least one of a memory space and a processor core number of the container, and the like.
And 103, configuring corresponding resources for the container according to the resource configuration information.
The resource refers to a resource selected for the container from a resource pool based on the resource configuration information.
In one embodiment of the present disclosure, in a case that the resource configuration information includes a size of a storage space, a corresponding storage space may be configured for the container according to the storage space.
In another embodiment of the present disclosure, where the resource configuration information includes a processor core number, the processor may be configured for the container according to the processor core number.
In another embodiment of the present disclosure, in the case that the resource configuration information includes a size of a storage space and a number of processor cores, a corresponding storage space may be configured for the container according to the storage space, and the processor may be configured for the container according to the number of processor cores.
According to the resource allocation method provided by the disclosure, when the container is allocated with resources, the task parameters of the model training task to be allocated to the container are obtained, the resource allocation information of the container is determined according to the task parameters of the model training task, and the corresponding resources are allocated for the container according to the resource allocation information. Therefore, by combining the model training task, the resource configuration information required by the container to execute the model training task is accurately determined, and the resources are accurately allocated to the container based on the resource configuration information, so that the automatic allocation of the container resources is realized, and the allocation accuracy of the container resources is improved.
In an embodiment of the present disclosure, when the task parameters include a training data set and a machine learning algorithm, in order to accurately determine a storage space of a container in resource configuration information of the container, a corresponding resource configuration is configured for the container, as shown in fig. 2, the resource allocation method includes:
step 201, determining a first storage space size occupied by a training data set.
It should be noted that the training data set includes a plurality of training data required under the model training task, and the machine learning algorithm may be a configured supervised learning, unsupervised learning, or reinforcement learning algorithm.
The training data set is obtained by sampling the past real data.
In one embodiment of the present disclosure, the set of training data may include a plurality of training data.
As an exemplary implementation manner, the size of the storage space occupied by a single training data in the training data set may be calculated, and then the size of the storage space occupied by the single training data is multiplied by the total number of the training data to obtain the size of the storage space occupied by all training data in the training data set, which is the size of the first storage space occupied by the training data set.
Step 202, determining a size of a second storage space to be configured for the container according to the machine learning algorithm and the size of the first storage space.
Wherein the resource configuration information includes a size of the second storage space.
In an embodiment of the present disclosure, a storage space correspondence corresponding to a machine learning algorithm may be obtained, and a storage space size corresponding to a first storage space size is obtained according to the storage space correspondence, and the obtained storage space size is used as a second storage space size.
Wherein, the storage space corresponding relation includes: the machine learning algorithm uses the minimum size of memory required when testing data sets of different memory sizes.
The test data set may be randomly generated or obtained through real data collection.
Typically, the test data set includes a plurality of test data sets, wherein, for each test data set, an exemplary process for determining a minimum storage size required by the machine learning algorithm when using the test data set is as follows:
specifically, first, the upper and lower bound values of the minimum storage space may be preset to determine the initial value range of the minimum storage space. For example, a larger value is used as an upper bound value of the initial minimum storage space, zero is used as a lower bound of the initial minimum storage space, and then an average value of the upper and lower bound values of the minimum storage space is calculated and used as the size of the minimum storage space required by the machine learning algorithm when the test data set is used.
Starting a machine learning task under the condition that the average value of the upper and lower bound values of the minimum storage space is taken as the size of the minimum storage space, and if the machine learning task runs successfully, updating the average value to the upper bound of the minimum storage space; and if the machine learning task fails due to insufficient storage space, updating the average value to the lower bound of the minimum storage space. And continuing analogizing by the method until the difference value of the upper and lower bound values of the minimum storage space is reduced to a preset range.
Wherein the preset range is a critical value of a difference value between an upper bound value and a lower bound value of the minimum storage space.
In an embodiment of the present disclosure, a storage space related parameter corresponding to a machine learning algorithm and a specified first calculation function may be obtained, and the size of the first storage space is input into the first calculation function to obtain a size of a second storage space to be configured for a container.
And step 203, configuring a storage space for the container according to the second storage space size.
In an embodiment of the present disclosure, the obtained second storage space size is the storage space size of the container, that is, a storage space with the second storage space size value can be configured for the container.
In the embodiment of the present disclosure, in order to accurately determine the processor core number of a container in the resource configuration information of the container and configure corresponding resource configuration for the container when the task parameters include a training data set, a machine learning algorithm, and a preset training duration, as shown in fig. 3, the resource allocation method includes:
step 301, determining a first storage space size occupied by a training data set.
For detailed description of this step, refer to step 201 above, and are not described herein again.
Step 302, determining the number of processor cores to be configured in the container according to the machine learning algorithm, the size of the first storage space and the preset training time.
The resource configuration information comprises the number of the processor cores.
In an embodiment of the disclosure, a second calculation function corresponding to the machine learning algorithm may be obtained, and the size of the first storage space and the preset training duration may be input to the second calculation function to obtain the number of processor cores to be configured for the container.
And the preset training time is the time required by the set model training.
In another embodiment of the present disclosure, in a case that it is determined that the machine learning algorithm supports multi-core accelerated computing, the resource occupation raw data corresponding to the machine learning algorithm may be obtained, and when the running time is closest to the preset training duration and the size of the storage space is closest to the size of the second storage space according to the resource occupation raw data, the corresponding processor core number may be obtained, where the resource occupation raw data includes: the machine learning algorithm is used for carrying out model training under test data sets with different storage space sizes and processor cores at corresponding running time.
One way to determine whether the machine learning algorithm can support multi-core accelerated computing may be: under the condition that other configurations in the model training task are fixed, the model training task using the machine learning algorithm is operated by using different processor cores (such as 1, 2, 4, 8, 16 and the like) so as to obtain the operation time of the model training task under different processor cores, and if the machine learning algorithm is determined to be unrelated to the processor cores based on the operation time of the model training task under different processor cores, at this moment, the machine learning algorithm is determined not to support multi-core accelerated calculation.
In other embodiments, the machine learning algorithm may be determined to support multi-core accelerated computing if the machine learning algorithm is determined to have a relationship with a processor core number based on a run time of a next model training task at a different processor core number.
In some exemplary embodiments, the acquiring manner of the raw data of resource occupation may be: the machine learning algorithm can be tested by using the test data with different storage space sizes and the processors with different numbers of cores, so that the respective running times of the machine learning algorithm under the test data with different storage space sizes and the processors with different numbers of cores can be obtained.
In some embodiments of the disclosure, in the case where the machine learning algorithm does not support multi-core accelerated computations, the number of processor cores to be configured for the container may be determined to be one.
Step 303, configuring a processor for the container according to the processor core number.
In one embodiment of the disclosure, a corresponding number of processors may be configured for the container based on the derived number of processor cores.
Specifically, if the obtained processor core number is a single core, a single processor is configured for the container; and if the obtained processor cores are multi-core, configuring a plurality of corresponding processors for the container.
In an embodiment of the present disclosure, in order to accurately configure a storage space of a container and a processor in a case that task parameters include a training data set, a machine learning algorithm, and a preset training duration, in some exemplary embodiments, according to task parameters of a model training task, one possible implementation manner of determining resource configuration information of the container is: determining a first storage space occupied by a training data set, and determining a second storage space to be configured for a container according to a machine learning algorithm and the first storage space, wherein the resource configuration information comprises the second storage space; and determining the number of processor cores to be configured of the container according to the machine learning algorithm, the size of the first storage space and the preset training time, wherein the resource configuration information further comprises the number of the processor cores.
Correspondingly, according to the resource configuration information, one possible implementation manner for configuring the corresponding resource for the container is as follows: and configuring a storage space for the container according to the size of the second storage space, and configuring a processor for the container according to the number of the processor cores.
In this embodiment of the present disclosure, in order to accurately determine the size of the first storage space occupied by the training data set, a possible implementation manner of determining the size of the first storage space occupied by the training data set in step 201 or step 301 may include, as shown in fig. 4:
step 401, determining a data type of training data in a training data set.
In one embodiment of the present disclosure, the data types of the training data may be divided into two types, occupied storage space size determination and occupied storage space size uncertainty.
As an example, the data type of the training data occupying the memory space of the size determination may be a boolean type, an integer type, a floating point type, a double precision floating point type, or the like.
As another example, the data type of training data that occupies an indeterminate amount of storage space may be a string data type.
Step 402, determining a first storage space size occupied by the training data set according to the data type and the total number of the training data in the training data set.
In one embodiment of the present disclosure, when calculating the first storage space size, the size of the storage space occupied by the training data is divided into two types, namely deterministic and indeterminate, due to the difference in data types. Therefore, the calculation method of the size of the memory space occupied by a single training data is different. The final principle is that the size of the storage space of a single piece of training data in the training data is multiplied by the total number of the training data in the training data to obtain the size of the first storage space occupied by the training data set.
In an embodiment of the present disclosure, the first storage space occupied by the training data set is calculated according to different data types, so that more accurate data can be obtained, and resource allocation can be better distributed.
In this embodiment of the present disclosure, in order to accurately determine the size of the first storage space occupied by the training data set, in step 402, according to the data type and the total number of the training data in the training data set, a possible implementation manner of determining the size of the first storage space occupied by the training data set is, as shown in fig. 5, including:
step 501, under the condition that the data type is the first data type, obtaining the size of the storage space occupied by the single training data of the first data type.
Wherein, the size of the storage space occupied by the single training data of the first data type is fixed.
Step 502, determining a first storage space size occupied by the training data set according to the storage space size and the total number of the training data in the training data set.
In an embodiment of the present disclosure, the size of the storage space occupied by a single training data is multiplied by the total number of training data in the training data set to obtain the size of the storage space occupied by all training data in the training data set, which is the first size of the storage space.
In this embodiment of the present disclosure, in order to accurately determine the size of the first storage space occupied by the training data set, in step 402, another possible implementation manner of determining the size of the first storage space occupied by the training data set according to the data type and the total number of training data in the training data set may include, as shown in fig. 6:
step 601, under the condition that the data type is the second data type, randomly extracting N pieces of training data from the training data set.
Wherein the size of the storage space occupied by the single training data of the second data type is not fixed.
As an exemplary embodiment, in order to more accurately calculate the size of the memory occupied by a single training data, N training data may be randomly extracted from the training data set.
Wherein N is an integer greater than 1 and less than the total number of training data in the set of training data. In some exemplary embodiments, there are usually thousands of training data in the training data set, where N may be 100, or 200, and in practical applications, the value of N may be set according to practical requirements, and the implementation is not limited in this respect.
As another exemplary embodiment, in order to more accurately calculate the size of the memory occupied by a single training data, a specified N number of training data may be extracted from the training data set. The extracted N pieces of designated training data include N/2 pieces of training data which occupy the largest storage space and the smallest storage space.
Step 602, determining an average size of a storage space occupied by the N training data according to the respective sizes of the storage space occupied by the N training data.
In an embodiment of the present disclosure, a sum of sizes of storage spaces occupied by the extracted N training data may be obtained according to sizes of storage spaces occupied by the extracted N training data, and then the sum of sizes of storage spaces occupied by the extracted N training data is divided by the number N of the training data, so as to obtain an average size of storage space occupied by the training data.
Step 603, determining the first storage space size occupied by the training data set according to the average storage space size and the total number of the training data in the training data set.
In an embodiment of the present disclosure, the average storage space size occupied by the obtained training data is the storage space size occupied by a single training data, at this time, the storage space size occupied by the single training data may be multiplied by the total number of the training data in the training data set, so as to obtain the storage space size occupied by all the training data in the training data set, which is the first storage space size.
In this embodiment, under the condition that the whole training data set is not loaded, the size of the first storage space occupied by the training data set is reasonably calculated according to the size of the storage space occupied by part of the training data in the training data set, and the size of the first storage space occupied by the training data set is conveniently and quickly determined.
In this embodiment of the disclosure, in order to accurately determine the size of the second storage space configured for the container, an implementation manner of determining the size of the second storage space configured for the container according to the machine learning algorithm and the size of the first storage space in step 202 may include, as shown in fig. 7:
step 701, obtaining storage space related parameters of a machine learning algorithm.
In an embodiment of the present disclosure, one implementation manner of obtaining storage space related parameters corresponding to a machine learning algorithm is as follows: the minimum storage space size required by the machine learning algorithm when the test data sets with different storage space sizes are used can be obtained, and the storage space parameters corresponding to the machine learning algorithm are determined according to the minimum storage space size required by the machine learning algorithm when the test data sets with different storage space sizes are used.
Specifically, the storage space size p and the minimum memory value q of the test data set are used as (p, q) binary groups, a function q = kp + b is obtained through a least square method, and the values of the slope k, the intercept b and the standard deviation theta are determined, wherein k, b and theta are storage space related parameters of the machine learning algorithm.
Step 702, determining the size of a storage space required by the training data set to complete the model training task under the machine learning algorithm according to the storage space related parameter, the first storage space size and the first calculation function.
Wherein the first calculation function may be a function set in advance for calculating the size of the storage space.
In one embodiment of the present disclosure, the calculation formula of the first calculation function Y1 may be Y1= cell1GB (k × x1+ b +3 × θ). Wherein, cell1GB is a function of rounding up to 1GB multiple to ensure sufficient memory resource; k x1+ b is used for estimating the average memory occupation; 3 theta is used to provide reasonable redundant memory, providing space for memory fluctuations for the task of machine learning.
Wherein b, k and theta in Y1 are storage space related parameters of a machine learning algorithm.
In an embodiment of the present disclosure, a first storage space size x2 occupied by the training data set is used as an input of the first calculation function, and a storage space size required for completing the model training task is Y2= cell1GB (k × x2+ b +3 × θ).
Step 703, using the obtained size of the storage space as the size of the second storage space to be configured for the container.
In this embodiment of the present disclosure, in order to accurately determine the number of processor cores to be configured for the container, in step 302, an implementation manner of determining the number of processor cores to be configured for the container according to a machine learning algorithm, a size of the first storage space, and a preset training duration may include, as shown in fig. 8:
step 801, obtaining processor related parameters of a machine learning algorithm.
In one embodiment of the present disclosure, one achievable way to obtain processor-related parameters of a machine learning algorithm is: the method can obtain the corresponding running time of the machine learning algorithm when model training is carried out under the test data sets and the processor cores with different storage space sizes, and determine the relevant parameters of the processor of the machine learning algorithm according to the corresponding running time of the machine learning algorithm when model training is carried out under the test data sets and the processor cores with different storage space sizes.
Specifically, the relationship among three undetermined coefficients c, d and e is obtained by taking the size p1 of a test data set, the kernel number n of a processor and the running time t as a data triple (p 1, n, t), and the least square fitting of a binary primary relationship is carried out by taking p1/n and p1 as two independent variables to obtain the numerical values of the three undetermined coefficients c, d and e, wherein c, d and e are processor related parameters of a machine learning algorithm.
Step 802, determining the number of processor cores required by the training data set to train the preset training time length under the machine learning algorithm according to the preset training time length, the size of the first storage space, the processor correlation parameter and the second calculation function.
The second calculation function may be a function that is set in advance and used for calculating the number of the processor cores.
In one embodiment of the present disclosure, in a case where the machine learning algorithm cannot support multi-core accelerated computation, a single-core processor needs to be configured for the container.
In another embodiment of the present disclosure, in a case where the machine learning algorithm supports multi-core accelerated computation, the computational formula of the specified second computation function Y3 for computing the number of processor cores may be Y3= cell (c × x 3/(t-d × x 3-e)). The cell is a function for rounding up to ensure that the number of the obtained processors is an integer.
Wherein c, d and e are processor-related parameters of the machine learning algorithm.
Wherein t represents a preset training duration, and x3 represents a first storage space occupied by the training data set.
Wherein the preset training duration t may be specified to be greater than d x3.
And step 803, taking the obtained processor core number as the processor core number to be configured of the container.
Corresponding to the resource allocation methods provided in the foregoing several embodiments, an embodiment of the present disclosure further provides a resource allocation apparatus, and since the resource allocation apparatus provided in the embodiment of the present disclosure corresponds to the resource allocation methods provided in the foregoing several embodiments, the implementation of the resource allocation method is also applicable to the resource allocation apparatus provided in the embodiment of the present disclosure, and will not be described in detail in the following embodiments.
Fig. 9 is a schematic diagram according to a ninth embodiment of the present disclosure. As shown in fig. 9, the resource allocation apparatus 90 includes: an acquisition module 91, a determination module 92 and a configuration module 93. Wherein:
the obtaining module 91 is configured to obtain task parameters of a model training task to be allocated to a container.
A determining module 92, configured to determine resource configuration information of the container according to task parameters of the model training task
And a configuration module 93, configured to configure corresponding resources for the container according to the resource configuration information.
Specifically, the configuration module 93 is configured to configure a storage space for the container according to the second storage space size; and configuring the processor for the container according to the number of the processor cores.
When the resource allocation device allocates resources to the container, the task parameters of the model training task to be allocated to the container are obtained, the resource allocation information of the container is determined according to the task parameters of the model training task, and the corresponding resources are allocated to the container according to the resource allocation information. Therefore, by combining the model training task, the resource configuration information required by the container to execute the model training task is accurately determined, and the resources are accurately allocated to the container based on the resource configuration information, so that the automatic allocation of the container resources is realized, and the allocation accuracy of the container resources is improved.
In an embodiment of the present disclosure, fig. 10 is a schematic diagram according to a tenth embodiment of the present disclosure. As shown in fig. 10, the resource allocation apparatus 100 may include: an acquisition module 101, a determination module 102, and a configuration module 103, wherein the determination module 102 may include a first determination unit 1021, a second determination unit 1022, and a third determination unit 1023, wherein the first determination unit 1021 may include a first determination subunit 10211 and a second determination subunit 10212.
It should be noted that, for the detailed description of the obtaining module 101 and the configuring module 103, reference may be made to the description of the obtaining module 91 and the configuring module 93 in fig. 9, and a description thereof is not further described here.
In one embodiment of the present disclosure, the task parameters include a training data set and a machine learning algorithm, and the determining module 102 may include:
a first determining unit 1021, configured to determine a first storage space size occupied by the training data set.
The second determining unit 1022 is specifically configured to determine a second storage space size to be configured for the container according to the machine learning algorithm and the first storage space size, where the resource configuration information includes the second storage space size.
The configuration module 103 is specifically configured to: and configuring the storage space for the container according to the second storage space size.
In an embodiment of the disclosure, the task parameters include a training data set, a machine learning algorithm, and a preset training duration, and the determining module 102 may include:
a first determining unit 1021, configured to determine a first storage space size occupied by the training data set.
The third determining unit 1023 is configured to determine the number of processor cores to be configured in the container according to the machine learning algorithm, the size of the first storage space, and the preset training duration, where the resource configuration information includes the number of processor cores.
The configuration module 103 is specifically configured to: and configuring the processor for the container according to the number of the processor cores.
In one embodiment of the present disclosure, the first determining unit 1021 may include:
a first determining subunit 10211, configured to determine a data type of the training data in the training data set.
The second determining subunit 10212 is configured to determine, according to the data type and the total number of training data in the training data set, the size of the first storage space occupied by the training data set.
In an embodiment of the disclosure, the second determining subunit 10212 is specifically configured to, when the data type is the first data type, obtain a size of a storage space occupied by a single training data of the first data type, where the size of the storage space occupied by the training data of the single first data type is fixed, and determine the size of the first storage space occupied by the training data set according to the size of the storage space and a total number of training data in the training data set.
In another embodiment of the present disclosure, the second determining subunit 10212 is specifically configured to: and under the condition that the data type is a second data type, randomly extracting N training data from a training data set, wherein the size of a storage space occupied by the training data of a single second data type is not fixed, determining the average size of the storage space occupied by the N training data according to the size of the storage space occupied by the N training data, and determining the size of a first storage space occupied by the training data set according to the average size of the storage space and the total number of the training data in the training data set, wherein N is an integer greater than 1 and is smaller than the total number of the training data in the training data set.
In an embodiment of the present disclosure, the second determining unit 1022 is specifically configured to obtain a storage space related parameter of a machine learning algorithm, determine, according to the storage space related parameter, the first storage space size, and a first calculation function, a storage space size required by a training data set to complete a model training task under the machine learning algorithm, and determine, as the size of the second storage space to be configured for the container, the obtained storage space size.
In an embodiment of the present disclosure, the third determining unit 1023 is specifically configured to obtain processor related parameters of a machine learning algorithm, determine, according to a preset training time, a size of the first storage space, the processor related parameters, and a second calculation function, a number of processor cores required by a training data set to train the preset training time under the machine learning algorithm, and use the obtained number of processor cores as a number of processor cores to be configured for the container.
It should be noted that the foregoing description of the embodiment of the resource allocation method is also applicable to the resource allocation apparatus, and the embodiment is not described again.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
Fig. 11 is a block diagram of an electronic device 1100 for implementing a resource allocation method of an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 11, the device 1100 comprises a computing unit 1101, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for the operation of the device 1100 may also be stored. The calculation unit 1101, the ROM1102, and the RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.
A number of components in device 1100 connect to I/O interface 1105, including: an input unit 1106 such as a keyboard, a mouse, and the like; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108, such as a magnetic disk, optical disk, or the like; and a communication unit 1109 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 1101 can be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The computing unit 1101 performs the various methods and processes described above, such as the resource allocation method. For example, in some embodiments, the resource allocation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1100 via ROM1102 and/or communication unit 1109. When a computer program is loaded into RAM 1103 and executed by the computing unit 1101, one or more steps of the resource allocation method described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform the resource allocation method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (16)

1. A method of resource allocation, comprising:
acquiring task parameters of a model training task to be allocated to a container;
determining resource configuration information of the container according to task parameters of the model training task;
configuring corresponding resources for the container according to the resource configuration information;
the method for determining the resource configuration information of the container according to the task parameters of the model training task comprises the following steps:
determining a first storage space size occupied by the training data set;
determining the number of processor cores to be configured of the container according to the machine learning algorithm, the size of the first storage space and the preset training time, wherein a second calculation function corresponding to the machine learning algorithm is obtained, the size of the first storage space and the preset training time are input into the second calculation function, and the number of processor cores to be configured of the container is obtained, wherein the resource configuration information comprises the number of processor cores;
the configuring the corresponding resource for the container according to the resource configuration information includes: and configuring a processor for the container according to the processor core number.
2. The method of claim 1, wherein the task parameters comprise a set of training data and a machine learning algorithm, and the determining resource configuration information for the container from the task parameters of the model training task comprises:
determining a first storage space size occupied by the training data set;
determining a second storage space size to be configured for the container according to the machine learning algorithm and the first storage space size, wherein the resource configuration information comprises the second storage space size;
the configuring the corresponding resource for the container according to the resource configuration information includes:
and configuring storage space for the container according to the size of the second storage space.
3. The method of any of claims 1-2, wherein the determining a first storage size occupied by the set of training data comprises:
determining a data type of training data in the training data set;
and determining the size of a first storage space occupied by the training data set according to the data type and the total number of the training data in the training data set.
4. The method according to claim 3, wherein the determining, according to the data type and the total number of training data in the training data set, a first storage space size occupied by the training data set includes:
under the condition that the data type is a first data type, acquiring the size of a storage space occupied by single training data of the first data type, wherein the size of the storage space occupied by the single training data of the first data type is fixed;
and determining the size of a first storage space occupied by the training data set according to the size of the storage space and the total number of the training data in the training data set.
5. The method of claim 3, wherein the determining, according to the data type and the total number of training data in the training data set, the first storage space size occupied by the training data set comprises:
under the condition that the data type is a second data type, randomly extracting N training data from the training data set, wherein the size of a storage space occupied by the single training data of the second data type is not fixed;
determining the average storage space size occupied by the N training data according to the storage space size occupied by the N training data;
and determining a first storage space size occupied by the training data set according to the average storage space size and the total number of the training data in the training data set, wherein N is an integer greater than 1 and smaller than the total number of the training data in the training data set.
6. The method of claim 2, wherein the determining a second storage size for the container to be configured according to the machine learning algorithm and the first storage size comprises:
obtaining storage space related parameters of the machine learning algorithm;
determining the size of a storage space required by the training data set to complete the model training task under the machine learning algorithm according to the storage space related parameters, the first storage space size and a first calculation function;
and taking the obtained size of the storage space as the size of a second storage space to be configured for the container.
7. The method of claim 1, wherein the determining the number of processor cores to be configured for the container according to the machine learning algorithm, the first memory size, and the preset training duration comprises:
acquiring processor related parameters of the machine learning algorithm;
determining the number of processor cores required by the training data set to train the preset training time length under the machine learning algorithm according to the preset training time length, the size of the first storage space, the relevant parameters of the processor and a second calculation function;
and taking the obtained processor core number as the processor core number to be configured of the container.
8. A resource allocation apparatus, comprising:
the acquisition module is used for acquiring task parameters of a model training task to be allocated to a container;
the determining module is used for determining the resource configuration information of the container according to the task parameters of the model training task;
the configuration module is used for configuring corresponding resources for the container according to the resource configuration information;
wherein, the task parameter includes training data set, machine learning algorithm and preset training duration, the determining module includes:
a first determining unit, configured to determine a size of a first storage space occupied by the training data set;
a third determining unit, configured to determine the number of processor cores to be configured of the container according to the machine learning algorithm, the size of the first storage space, and the preset training duration, wherein a second calculation function corresponding to the machine learning algorithm is obtained, and the size of the first storage space and the preset training duration are input to the second calculation function, so that the number of processor cores to be configured of the container is obtained, and the resource configuration information includes the number of processor cores;
the configuration module is specifically configured to: and configuring a processor for the container according to the processor core number.
9. The apparatus of claim 8, wherein the task parameters include a training data set and a machine learning algorithm, the determination module including:
a first determining unit, configured to determine a size of a first storage space occupied by the training data set;
a second determining unit, configured to determine, according to the machine learning algorithm and the size of the first storage space, a size of a second storage space to be configured for the container, where the resource configuration information includes the size of the second storage space;
the configuration module is specifically configured to: and configuring storage space for the container according to the size of the second storage space.
10. The apparatus according to any one of claims 8 or 9, wherein the first determining unit comprises:
a first determining subunit, configured to determine a data type of training data in the training data set;
and the second determining subunit is configured to determine, according to the data type and the total number of the training data in the training data set, the size of the first storage space occupied by the training data set.
11. The apparatus according to claim 10, wherein the second determining subunit is specifically configured to:
under the condition that the data type is a first data type, acquiring the size of a storage space occupied by single training data of the first data type, wherein the size of the storage space occupied by the single training data of the first data type is fixed;
and determining the size of a first storage space occupied by the training data set according to the size of the storage space and the total number of the training data in the training data set.
12. The apparatus according to claim 10, wherein the second determining subunit is specifically configured to:
under the condition that the data type is a second data type, randomly extracting N training data from the training data set, wherein the size of a storage space occupied by the single training data of the second data type is not fixed;
determining the average storage space size occupied by the N training data according to the storage space size occupied by the N training data;
and determining the size of a first storage space occupied by the training data set according to the average storage space size and the total number of the training data in the training data set, wherein N is an integer greater than 1 and smaller than the total number of the training data in the training data set.
13. The apparatus according to claim 9, wherein the second determining unit is specifically configured to:
obtaining storage space related parameters of the machine learning algorithm;
determining the size of a storage space required by the training data set to complete the model training task under the machine learning algorithm according to the storage space related parameters, the first storage space size and a first calculation function;
and determining the obtained size of the storage space as the size of a second storage space to be configured for the container.
14. The apparatus according to claim 8, wherein the third determining unit is specifically configured to:
acquiring processor related parameters of the machine learning algorithm;
determining the number of processor cores required by the training data set to train the preset training time length under the machine learning algorithm according to the preset training time length, the size of the first storage space, the relevant parameters of the processor and a second calculation function;
and taking the obtained processor core number as the processor core number to be configured of the container.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
CN202210511762.2A 2022-05-10 2022-05-10 Resource allocation method and device Active CN114816758B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210511762.2A CN114816758B (en) 2022-05-10 2022-05-10 Resource allocation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210511762.2A CN114816758B (en) 2022-05-10 2022-05-10 Resource allocation method and device

Publications (2)

Publication Number Publication Date
CN114816758A CN114816758A (en) 2022-07-29
CN114816758B true CN114816758B (en) 2023-01-06

Family

ID=82514238

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210511762.2A Active CN114816758B (en) 2022-05-10 2022-05-10 Resource allocation method and device

Country Status (1)

Country Link
CN (1) CN114816758B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110363303A (en) * 2019-06-14 2019-10-22 平安科技(深圳)有限公司 Smart allocation model training memory method, apparatus and computer readable storage medium
CN113467922A (en) * 2020-03-30 2021-10-01 阿里巴巴集团控股有限公司 Resource management method, device, equipment and storage medium
CN113592209A (en) * 2021-02-04 2021-11-02 腾讯科技(深圳)有限公司 Model training task management method, device, terminal and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109976903B (en) * 2019-02-22 2021-06-29 华中科技大学 Deep learning heterogeneous computing method and system based on layer width memory allocation
US11514317B2 (en) * 2020-03-25 2022-11-29 EMC IP Holding Company LLC Machine learning based resource availability prediction
US11797340B2 (en) * 2020-05-14 2023-10-24 Hewlett Packard Enterprise Development Lp Systems and methods of resource configuration optimization for machine learning workloads
US11829799B2 (en) * 2020-10-13 2023-11-28 International Business Machines Corporation Distributed resource-aware training of machine learning pipelines

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110363303A (en) * 2019-06-14 2019-10-22 平安科技(深圳)有限公司 Smart allocation model training memory method, apparatus and computer readable storage medium
CN113467922A (en) * 2020-03-30 2021-10-01 阿里巴巴集团控股有限公司 Resource management method, device, equipment and storage medium
CN113592209A (en) * 2021-02-04 2021-11-02 腾讯科技(深圳)有限公司 Model training task management method, device, terminal and storage medium

Also Published As

Publication number Publication date
CN114816758A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN114416351B (en) Resource allocation method, device, equipment, medium and computer program product
CN112925587A (en) Method and apparatus for initializing applications
CN113568821A (en) Method, device, equipment and medium for testing computation performance of AI chip
CN115168130A (en) Chip testing method and device, electronic equipment and storage medium
CN114816758B (en) Resource allocation method and device
CN115481594B (en) Scoreboard implementation method, scoreboard, electronic equipment and storage medium
CN115495151A (en) Rule engine migration method, device, equipment, storage medium and program product
CN113377295B (en) Data storage and reading method, device and equipment for multi-producer single-consumer
CN115269431A (en) Interface testing method and device, electronic equipment and storage medium
CN113570067B (en) Synchronization method and device of distributed system
CN114386577A (en) Method, apparatus, and storage medium for executing deep learning model
CN114416357A (en) Method and device for creating container group, electronic equipment and medium
CN113220573A (en) Test method and device for micro-service architecture and electronic equipment
CN113408304A (en) Text translation method and device, electronic equipment and storage medium
CN115098405B (en) Software product evaluation method and device, electronic equipment and storage medium
CN115495312B (en) Service request processing method and device
CN112989797B (en) Model training and text expansion methods, devices, equipment and storage medium
CN115759260B (en) Reasoning method and device of deep learning model, electronic equipment and storage medium
CN117271113A (en) Task execution method, device, electronic equipment and storage medium
CN116308713A (en) Multiplexing method and device for business transaction codes, electronic equipment and storage medium
CN115203083A (en) Information processing method and device and electronic equipment
CN115640112A (en) Resource scheduling method, device, equipment and medium based on label
CN116243984A (en) Data processing device, method, electronic device, and storage medium
CN116167519A (en) Monitoring amount prediction method, device, equipment and medium
CN115686862A (en) Capacity data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant