CN112148469B

CN112148469B - Method and device for managing resources and computer storage medium

Info

Publication number: CN112148469B
Application number: CN201910580361.0A
Authority: CN
Inventors: 林丹峰
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2024-02-20
Anticipated expiration: 2039-06-28
Also published as: CN112148469A

Abstract

The application discloses a method and a device for managing resources and a computer storage medium, and belongs to the technical field of deep learning. The method comprises the following steps: the method comprises the steps of determining the usable total amount of reference resources deployed in a deep learning training platform, and determining the using amount threshold of a plurality of users according to the usable total amount. The usage threshold is used for indicating the amount of the reference resource which can be used by the task submitted by the user, so that the resource management method provided by the embodiment of the application can realize the allocation of the reference resource according to each user so as to improve the utilization rate of the reference resource.

Description

Method and device for managing resources and computer storage medium

Technical Field

The present invention relates to the field of deep learning technologies, and in particular, to a method and apparatus for managing resources, and a computer storage medium.

Background

In the deep learning field, a computer typically needs to call resources such as GPU (graphics processing unit, image processing unit) and memory to complete training tasks in the deep learning process. To avoid wasting these resources, it is necessary to manage the resources in the computer to increase the utilization of these resources.

Disclosure of Invention

The embodiment of the application provides a method, a device and a computer storage medium for managing resources, which can improve the utilization rate of the resources. The technical scheme is as follows:

in one aspect, a method of managing resources is provided, the method comprising:

determining a usable total amount of a reference resource deployed in the deep learning training platform;

and determining a usage threshold of a plurality of users according to the total usable amount, wherein the usage threshold is used for indicating the amount of the reference resource which can be used by the task submitted by the user.

Optionally, the plurality of users are divided into a plurality of user groups, and each user group in the plurality of user groups is configured with an allocation proportion;

the usage threshold of a plurality of users is determined according to the total amount of the available users. Comprising the following steps:

determining the usable amount of each user group in the plurality of user groups according to the usable total amount and the allocation proportion of each user group in the plurality of user groups;

for a first user group in the plurality of user groups, determining the available amount of each user in the first user group according to the available amount of the first user group and the users included in the first user group, wherein the first user group is any one of the plurality of user groups;

A usage threshold for each user is determined based on the available usage for each user.

Optionally, the determining the usage threshold of the plurality of users according to the total usable amount includes:

displaying a resource allocation prompt message, wherein the resource allocation prompt message carries the total usable amount and/or task resource requirements of each user and is used for indicating an administrator to allocate the reference resource according to the total usable amount and/or task resource requirements of each user so as to obtain usage thresholds of the plurality of users;

and detecting a first reporting instruction triggered by the administrator, wherein the first reporting instruction carries the usage amount threshold values of the plurality of users.

Optionally, the memory deployed on the deep learning training platform includes public memory resources and private memory resources, the reference resources are private memory resources in the memory, and the usable total amount of the reference resources is used to indicate the usable memory capacity of the private memory resources.

Optionally, the determining the usable total amount of the reference resource deployed in the deep learning training platform includes:

determining the maximum storage capacity of the memory when the deep learning training platform is initialized;

Displaying a capacity prompt message, wherein the capacity prompt message carries the maximum storage capacity and is used for indicating an administrator to divide the storage into the public storage resource and the private storage resource according to the maximum storage capacity so as to obtain the theoretical maximum storage capacity of the public storage resource and the theoretical maximum storage capacity of the private storage resource;

detecting a second reporting instruction triggered by the administrator, wherein the second reporting instruction carries the theoretical maximum storage capacity of the public storage resource and the theoretical maximum storage capacity of the private storage resource, and the theoretical maximum storage capacity of the private storage resource is used as the usable total amount of the private storage resource.

Optionally, the reference resource is an image processing unit GPU deployed on the deep learning training platform, and the total amount of available reference resource is used to indicate the available time period of the GPU.

if the current time is an update time point, determining a historical total use time length of the GPU in a first reference time period which is before the current time and is nearest to the current time, wherein the first reference time period is a time length between two adjacent update time points, and taking the historical total use time length as the usable total amount of the GPU.

if the current time is the time when the deep learning training platform is initialized, acquiring the type of each GPU and the theoretical use duration of each GPU in the GPUs;

determining the weight of each GPU according to the type of each GPU;

and determining the theoretical total use duration of the GPUs according to the weight of each GPU in the GPUs and the theoretical use duration of each GPU, and taking the theoretical total use duration as the usable total of the GPUs.

Optionally, the determining the weight of each GPU according to the type of each GPU includes:

displaying a weight configuration message, wherein the weight configuration message carries the type of each GPU in the GPUs and is used for prompting an administrator to configure the weight of each GPU based on the type of each GPU;

detecting a third reporting instruction triggered by the administrator, wherein the third reporting instruction carries the weight of each GPU in the GPUs.

if the current time is the time when the deep learning training platform is initialized, displaying a GPU allocation prompt message for prompting an administrator to acquire the theoretical total use duration of the GPU;

Detecting a fourth reporting instruction triggered by the administrator, wherein the fourth reporting instruction carries the theoretical total using time length of the GPU, and the theoretical total using time length is used as the usable total amount of the GPU.

Optionally, the method further comprises:

acquiring the change condition of the utilization rate of the GPU along with time in a second reference time period which is before the current time and is nearest to the current time;

determining a recommended time period according to the change condition of the utilization rate of the GPU along with time, wherein the utilization rate of the GPU in the recommended time period is lower than the utilization rate in other time periods in the second reference time period;

and sending recommendation information to the plurality of users, wherein the recommendation information carries the recommendation time period and is used for indicating the plurality of users to submit tasks needing to call the GPU in the recommendation time period.

Optionally, the method further comprises:

for a first user of the plurality of users, when a first task submitted by the first user is received, acquiring the historical usage amount of the task submitted by the first user on the reference resource, wherein the first task is a task needing to call the reference resource, and the first user is any user of the plurality of users;

And if the historical usage amount is greater than or equal to the usage amount threshold of the first user, generating and displaying prompt information, wherein the prompt information is used for indicating that the first task cannot be executed currently.

Optionally, after generating and displaying the prompt information, the method further includes:

receiving a usage threshold up-regulation request sent by the first user;

if the reference resource has residual usage at the current time, adjusting the usage threshold of the first user according to the usage threshold up-regulation request so that the historical usage is smaller than the adjusted usage threshold;

and executing the first task.

detecting an administrator-triggered usage threshold up-adjustment instruction, wherein the usage threshold up-adjustment instruction carries a usage threshold redistributed by the administrator to the first user;

and executing the first task.

In another aspect, an apparatus for managing resources is provided, the apparatus comprising:

a first determining module for determining a usable total amount of reference resources deployed in the deep learning training platform;

and the second determining module is used for determining a using amount threshold value of a plurality of users according to the usable total amount, wherein the using amount threshold value is used for indicating the amount of the reference resource which can be used by the task submitted by the user.

the second determining module is specifically configured to:

Optionally, the second determining module is specifically configured to:

Optionally, the first determining module is specifically configured to:

determining the weight of each GPU according to the type of each GPU;

Optionally, the first determining module is further specifically configured to:

Optionally, the first determining module is specifically configured to:

Optionally, the apparatus further comprises:

the first acquisition module is used for acquiring the change condition of the utilization rate of the GPU along with time in a second reference time period which is before the current time and is nearest to the current time;

the third determining module is used for determining a recommended time period according to the change condition of the utilization rate of the GPU along with time, wherein the utilization rate of the GPU in the recommended time period is lower than that in other time periods in the second reference time period;

The sending module is used for sending recommendation information to the plurality of users, wherein the recommendation information carries the recommendation time period and is used for indicating the plurality of users to submit tasks needing to call the GPU in the recommendation time period.

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring the historical usage amount of the task submitted by the first user on the reference resource when receiving the first task submitted by the first user, wherein the first task is a task needing to call the reference resource, and the first user is any user of the plurality of users;

and the generation module is used for generating and displaying prompt information if the historical usage amount is greater than or equal to the usage amount threshold of the first user, wherein the prompt information is used for indicating that the first task cannot be executed currently.

Optionally, the apparatus further comprises:

the receiving module is used for receiving a usage threshold up-regulation request sent by the first user;

the adjustment module is used for adjusting the usage threshold of the first user according to the usage threshold up-adjustment request if the reference resource has residual usage at the current time, so that the historical usage is smaller than the adjusted usage threshold;

And the execution module is used for executing the first task.

Optionally, the apparatus further comprises:

the detection module is used for detecting an administrator-triggered usage threshold up-regulation instruction, wherein the usage threshold up-regulation instruction carries a usage threshold redistributed by the administrator to the first user;

and the execution module is used for executing the first task.

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the steps of any of the methods of managing resources described above.

In another aspect, a computer readable storage medium having instructions stored thereon which when executed by a processor implement the steps of any of the methods of managing resources described above is provided.

The beneficial effects that technical scheme that this application embodiment provided brought are:

in the embodiment of the application, the usable total amount of the reference resources deployed in the deep learning training platform is determined, and the usage threshold of a plurality of users is determined according to the usable total amount. The usage threshold is used for indicating the amount of the reference resource which can be used by the task submitted by the user, so that the resource management method provided by the embodiment of the application can realize the allocation of the reference resource according to each user so as to improve the utilization rate of the reference resource.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for managing resources according to an embodiment of the present application.

Fig. 2 is a block diagram of an apparatus for managing resources according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for managing resources according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:

step 101: a total amount of available reference resources deployed in the deep learning training platform is determined.

The deep learning training platform is used for resource management. For example, the deep learning training platform may be a deep learning training platform deployed inside an enterprise, which may be used for deep learning. Multiple servers may be deployed in the deep learning training platform, and multiple GPUs and multiple memories may be deployed in each server.

After the user submits the deep learning training task, the deep learning training platform needs to invoke resources to perform the deep learning training task. The resources to be invoked may include a GPU for processing data and may also include a memory for storing the processed data. Thus, the reference resource in step 101 may be a GPU or a memory. That is, the method for managing resources provided in the embodiment of the present application may be used for managing resources such as a memory in a deep learning training platform, or may be used for managing resources such as a GPU in the deep learning training platform, which, of course, may also be used for managing resources of other types in the deep learning training platform. For convenience of description later, a terminal that performs resource management will be referred to as a resource management terminal.

Since the reference resource of the type of memory and the reference resource of the type of GPU have different properties, the explanation will be made below for the two cases, respectively.

Example one: for resources such as memory, since memory is used to store data, managing such reference resources refers to managing the storage capacity of the reference resources. That is, when the reference resource in step 101 is a memory, the usable total amount of the reference resource is used to indicate the usable storage capacity of the memory. In addition, the storage capacity of the memory is a fixed variable and does not increase or decrease with time. Thus, the management of the reference resources may be performed at the time of initialization of the deep learning training platform.

To further rationalize the management of memory, the memory may be divided into public and private memory resources. The common storage resource is for storing a common data set. Such as the data sets disclosed in the internet. The private storage resources are used to allocate individual users for individual users to store personal data. Thus, the reference resource in step 101 refers specifically to a private storage resource in memory.

At this time, the implementation manner of step 101 may be: and when the deep learning training platform is initialized, the resource management terminal determines the theoretical maximum storage capacity of the private storage resource, and takes the theoretical maximum storage capacity as the usable total amount of the private storage resource.

For example, the memory in the deep learning training platform has a total memory capacity of 200TB (terabyte). The 40TB may be divided into public storage resources in advance and the 160TB into private storage resources. The method for managing the resources is mainly applied to managing the 160TB private storage resources. That is, the total amount of available storage for the private storage resources at this time is 160TB.

In addition, the implementation of step 101 may be implemented by a management based on a human operator, that is, the administrator manually allocates resources in the memory. At this time, the implementation manner of step 101 may be: upon initialization of the deep learning training platform, a maximum storage capacity of the memory is determined. And displaying a capacity prompt message, wherein the capacity prompt message carries the maximum storage capacity and is used for indicating an administrator to divide the storage into public storage resources and private storage resources according to the maximum storage capacity so as to obtain the theoretical maximum storage capacity of the public storage resources and the theoretical maximum storage capacity of the private storage resources. Detecting a second reporting instruction triggered by an administrator, wherein the second reporting instruction carries the theoretical maximum storage capacity of the public storage resource and the theoretical maximum storage capacity of the private storage resource, and the theoretical maximum storage capacity of the private storage resource is used as the usable total amount of the private storage resource.

For example, the memory in the deep learning training platform has a total memory capacity of 200TB (terabyte). At this time, a resource allocation hint may be displayed in the display interface, and the resource allocation hint may indicate that the memory has a total storage capacity of 200 TB. At this time, the administrator may divide 40TB into public storage resources and 160TB into private storage resources according to the resource allocation hint information.

For example two, for a GPU that is a type of reference resource, since the GPU is used to process data, managing such reference resource refers to managing the usage time of the reference resource. That is, the total amount of available reference resources is used to indicate the available time period of the GPU. The duration of use of the GPU is a variable that changes and increases with time. Thus, when the reference resource in step 101 is a GPU, the management of the reference resource may be performed periodically.

The use time length can be measured by the card time, wherein each time the user uses the GPU for 1 hour, the use time length of the user on the GPU is determined to be 1 card time. Of course, the usage period may also be measured by other measurement units, which are not specifically limited in the embodiments of the present application.

Further, since the usage time of the GPU is periodically updated, in order to achieve reasonable management of the GPU, the GPU may be managed in combination with the historical usage situation of the GPU each time the GPU is managed, so as to improve the subsequent utilization rate of the GPU. Thus, in one possible implementation, step 101 may specifically be: and determining the historical total use time length of the GPU in a first reference time period which is before the current time and is nearest to the current time, wherein the first reference time period is the time length between two adjacent updating time points, and the historical total use time length is used as the usable total amount of the GPU.

The first reference time period is a time period configured by a background manager, and is used for indicating a period for managing the reference resources such as the GPU. For example, the first reference time period is 1 month, which indicates that the reference resources are managed by the resource management method provided by the application every 1 month. For another example, the first reference time period is half a year, and the reference resources are managed every half a year through the resource management method provided by the application.

That is, in the embodiment of the present application, the reference resources such as GPU may be periodically reallocated to achieve dynamic management of the reference resources. When the reference resources are reassigned each time, the reference resources in the next period can be reassigned based on the use condition of the reference resources in the previous period, so that the reassigned reference resources more meet the requirements of users, and the utilization rate of the reference resources is improved.

In example two such scenarios, if the deep learning training platform is currently in initialization, there is no relevant data for the previous cycle. In this case, it is necessary to determine the usable total amount of the reference resource by other means.

In one possible implementation, if the current time is the time when the deep learning training platform is initialized, acquiring the type of each GPU and the theoretical use duration of each GPU in the GPUs; determining the weight of each GPU according to the type of each GPU; and determining the theoretical total use time length of the GPUs according to the weight of each GPU in the GPUs and the theoretical use time length of each GPU, and taking the theoretical total use time length as the usable total amount of the GPUs.

Because the different types of GPUs have different computing capabilities, the amount of data actually processed when the different types of GPUs use the same time period is also different. To further rationally manage the class of reference resources of GPUs, the weight of each GPU may be configured according to the type of GPU. Wherein the higher the computing power, the greater the weight of the GPU configuration.

For example, 3 GPUs are deployed in the deep learning training platform, and are labeled as a first GPU, a second GPU, and a third GPU, respectively. The three GPUs are different in type, the weight configured for the first GPU is 1 according to the type of the first GPU, the weight configured for the second GPU is 0.8 according to the type of the second GPU, and the weight configured for the third GPU is 1.2 according to the type of the third GPU. If the theoretical use duration of the first GPU, the second GPU and the third GPU is 24×30 (representing a theoretical use duration of one month) card. The theory of GPU as a reference resource uses the total duration of (24×30×1+24×30×0.8+24×30×1.2) cards.

The above determination of the weight of each GPU according to the type of each GPU may be specified manually. Thus, in one possible implementation, determining the weight of each GPU according to its type may be: and displaying a weight configuration message, wherein the weight configuration message carries the type of each GPU in the GPUs and is used for prompting an administrator to configure the weight of each GPU based on the type of each GPU. Detecting a third reporting instruction triggered by the administrator, wherein the third reporting instruction carries the weight of each GPU in the GPUs. At this time, the administrator can configure different weights for each GPU according to the type of each GPU by himself, so that flexibility of determining the weights of each GPU is improved.

In addition, at the time of initialization of the deep learning training platform, the total amount of available GPUs can be specified by an administrator as well. Thus, in one possible implementation, determining the total amount of available reference resources deployed in the deep learning training platform may be: if the current time is the time when the deep learning training platform is initialized, displaying a GPU allocation prompt message for prompting an administrator to acquire the theoretical total use duration of the GPU; detecting a fourth reporting instruction triggered by an administrator, wherein the fourth reporting instruction carries the theoretical total using time length of the GPU, and the theoretical total using time length is used as the usable total amount of the GPU. The administrator acquires the theoretical total use duration of the GPU in a manual mode.

Step 102: a usage threshold for a plurality of users is determined based on the total amount of available usage, the usage threshold being used to indicate an amount of reference resources that can be used by tasks submitted by the users.

If the deep learning training platform in step 101 is a deep learning training platform deployed in an enterprise, the demands of different parts on resources are not generally the same, as different users in the enterprise typically belong to different departments. Therefore, in the embodiment of the present application, a plurality of users may be divided into a plurality of user groups in advance, and each of the plurality of user groups is configured with an allocation ratio. At this time, the implementation manner of step 102 may be: determining the usable amount of each user group in the plurality of user groups according to the usable total amount and the allocation proportion of each user group in the plurality of user groups; for a first user group in the plurality of user groups, determining the available amount of each user in the users included in the first user group according to the available amount of the first user group and the users included in the first user group, determining the using amount threshold of each user according to the available amount of each user, wherein the first user group is any one of the plurality of user groups.

The implementation manner of determining the available amount of each user in the users included in the first user group according to the available amount of the first user group and the users included in the first user group may be: dividing the usable amount of the first user group by the number of the users included in the first user group, and taking the numerical value after dividing as the usable amount of each user.

For example, the reference resource is a private storage resource, and currently there are three user groups, respectively labeled as user group a, user group b, and user group c. When the total usable amount of the private storage resources is 160TB, according to the allocation ratio of each user group, it is determined that the usable amount of the user group a is 60TB, the usable amount of the user group b is 50TB, and the usable amount of the user group c is 50TB. The group leader of each user group can distribute the usable quantity of the user group to each user in the user group to obtain the usable quantity of each user, and then the threshold value of the usable quantity of each user is determined.

The implementation of determining the usage threshold of each user according to the available usage of each user may be: the usable amount of each user is directly used as the threshold value of the usable amount of each user.

Alternatively, in the embodiments of the present application, the usage thresholds of multiple users may be configured by an administrator. Thus, in one possible implementation, the implementation of step 102 may also be: displaying a resource allocation prompt message, wherein the resource allocation prompt message carries the total usable amount and/or task resource requirements of each user and is used for indicating an administrator to allocate reference resources according to the total usable amount and/or task resource requirements of each user so as to obtain usage amount thresholds of a plurality of users; and detecting a first reporting instruction triggered by an administrator, wherein the first reporting instruction carries the usage amount threshold values of a plurality of users.

That is, when the resource allocation prompt information is displayed in the display interface of the resource allocation terminal, the administrator may refer to the resource to allocate the resource according to the resource allocation prompt information. For example, when the resource allocation hint information carries that the total available amount of the private storage resources is 160TB, the administrator may allocate 60TB of the private storage resources to the group leader of the group a, 50TB of the private storage resources to the group leader of the group b, and 50TB of the private storage resources to the group leader of the group c. And continuously distributing private storage resources for each user in the group by each group leader according to the conditions of each user in the group. After the group owner allocates the private storage resources for each user in the group, the private storage resources allocated to each user can be reported to an administrator, so that the administrator can obtain the usage threshold values of the plurality of users.

The task resource requirements of each user described above may be used to indicate the amount of reference resources that each user needs. For example, the storage capacity of the private storage resource required for each user, or the usage period of the GPU required for each user, or the like.

In addition, for the GPU, since the reference resources are periodically reassigned, before the usable amount of each user is used as the threshold value of the usable amount of each user, the theoretical usable amount of the reference resources can be determined, and the difference between the historical usable amount and the theoretical usable amount can be determined. If the difference is greater than the difference threshold, indicating that the reference resource is currently in a resource-rich state, then the available amount of each user can be directly used as the usage threshold of each user.

Correspondingly, if the difference is smaller than the difference threshold, the fact that the reference resource is currently in a resource shortage state is indicated, at this time, in order to alleviate the resource shortage state of the reference resource, for a first user in the first user group, if the total amount of usage of the reference resource by the first user in a first reference time period which is before the current time and is closest to the current time exceeds the remaining amount relative to the usage threshold configured before the current time, the usage threshold of the first user is determined according to the exceeded usage and the available usage of the first user, and the first user is any user in the users included in the first user group.

The implementation manner of determining the usage threshold of the first user according to the usage exceeding and the available usage of the first user may be: subtracting the usable quantity of the first user from the exceeding usable quantity, and taking the subtracted value as a using quantity threshold value of the first user.

That is, when the current resource of the reference resource is relatively tense, if the usage of the reference resource is hyperbranched in the last period of the user, the usage of the reference resource hyperbranched in the last period can be deducted in the process of reallocating the reference resource. Thus, the use of the reference resource by the first user can be reduced, and the resource shortage condition of the reference resource is relieved.

In addition, after determining the available amount of each user in the users included in the first user group, the available amount of each user can be directly used as a usage amount threshold of each user, the available amount of each user can be multiplied by a reference proportion, and the multiplied value is used as the usage amount threshold of each user, so that the flexibility of allocating the reference resources is further improved. The reference ratio is a configured ratio, and may be 90%, 95%, or the like.

Through the steps 101 to 102, personalized allocation of any reference resource in the deep learning training platform can be realized, so that reasonable management of the resource of the deep learning training platform is realized.

In addition, for the GPU, after the reference resources are allocated in step 101 and step 102, the time-dependent change of the GPU utilization rate in the second reference time period before the current time and closest to the current time can be obtained; determining a recommended time period according to the change condition of the utilization rate of the GPU along with time, wherein the utilization rate of the GPU in the recommended time period is lower than the utilization rate in other time periods in the second reference time period; and sending recommendation information to a plurality of users, wherein the recommendation information carries a recommendation time period and is used for indicating the plurality of users to submit tasks needing to call the GPU in the recommendation time period. Through the process, the time distribution condition according to the utilization rate of the reference resources can be realized, and the utilization rate of the reference resources is further improved.

For example, for all GPUs deployed in the deep learning training platform, the usage rate of all GPUs in the period is counted in units of each hour, and a usage rate curve in each month is counted, where the usage rate curve is used to indicate the change condition of the usage rate of the GPUs with time. And counting the time period with the lowest utilization rate based on the utilization rate curve. And then, recommending the task submitted in the time period with the lowest utilization rate to the user so as to improve the utilization rate of the whole GPU.

The second reference period may or may not be the same period as the first reference period. For example, the first reference period may be 1 month, and the second reference period may be 1 month or half year.

After a user submits a task in an alternative embodiment of the present application, the neural network model in the task is executed, and the resources required for submitting the task are the resources required for running the neural network model. Assuming that the neural network model is used for training for recognizing ten thousand face sample pictures, the resources required for submitting the task are resources required for performing face recognition training by using the neural network model, and the resources can comprise GPU resources, cache resources and the like.

In addition, for the reference resource such as the memory, after the private storage resource is allocated through the steps 101 and 102, the data in the private storage resource can be automatically cleaned. For example, when the cleaning period is set to be 6 months and each time the cleaning time point is reached, the use frequency of each data in the private storage resource is obtained, and the data with the use frequency lower than the frequency threshold value is screened, so that the data can be directly cleaned. Or determining the user corresponding to the data, and sending cleaning prompt information to the user so as to remind the user to manually clean the data.

In addition, if the remaining storage capacity in the memory is 0, that is, the current memory is full of data. In this case, the memory can be expanded. For the expanded storage capacity, the resource allocation can be performed through the steps 101 to 102 as well.

Steps 101 to 102 are described above for explaining how to allocate reference resources in the deep learning training platform. After the reference resources are allocated, tasks submitted by users can be managed according to the allocated reference resources, so that the management of the reference resources is realized. The following steps 103 to 104 are used to explain the process.

Step 103: for a first user of the plurality of users, when a first task submitted by the first user is received, the historical usage amount of the task submitted by the first user on the reference resource is obtained, the first task is a task needing to call the reference resource, and the first user is any user of the plurality of users.

The usage threshold is the usage zero threshold determined in step 102.

In addition, for this type of reference resource, memory, the historical usage is used to indicate the storage capacity already occupied in the private storage space allocated for the first user. For a reference resource of the GPU type, the historical usage is used to indicate a historical usage period on the reference resource from a last update of the usage threshold of the first user prior to the current time to a task that the first user has submitted.

Step 104: if the historical usage is greater than or equal to the usage threshold of the first user, generating and displaying prompt information, wherein the prompt information is used for indicating that the first task cannot be executed currently.

For example, the reference resource is a private storage resource, and the usage threshold configured for the first user in step 102 is 10TB. If the historical usage has been equal to 10TB, indicating that there is currently no memory left to perform the task, the user may be prompted by step 104 described below.

For another example, when the reference resource is a GPU and the usage threshold configured for the first user in step 102 is 20 cards. If the historical usage has been equal to 20 cards, it is indicated that there is currently no time remaining for performing the first task, and therefore, the user may be prompted by step 104 described below.

Accordingly, if the historical usage is less than the usage threshold of the first user, the first task may be performed. In addition, if the historical usage is greater than or equal to the usage threshold of the first user, after generating and displaying the prompt message, the user may adjust the usage threshold upward in order to ensure smooth execution of the first task. In one possible implementation: receiving a usage threshold up-regulation request sent by a first user; if the current time reference resource has residual usage, adjusting the usage threshold of the first user according to the usage threshold up-regulation request so that the historical usage is smaller than the adjusted usage threshold; the first task is performed. Accordingly, if the current time reference resource does not have the remaining usage, it indicates that the usage threshold of the first user cannot be adjusted currently, and thus an up-regulation failure message may be returned to the first user.

The adjusting the usage threshold for the first user according to the usage threshold up-adjustment request may be performed by a group length of the user group where the first user is located through a preset operation. Therefore, in another possible implementation manner, after the prompt information is generated and displayed, an administrator triggered usage threshold up instruction may also be detected, where the usage threshold up instruction carries a usage threshold reassigned by the administrator for the first user; the first task is performed. That is, the usage threshold of the first user may be manually adjusted by the administrator. And will not be described in detail herein.

In addition, for the first user, in the process that the first user submits the task and executes the task through the deep learning training platform, the usage amount of the task submitted by the first user on the reference resource can be counted in real time, and if the usage amount is close to the zero threshold, the first user can send a usage amount threshold up-regulation request so as to regulate the usage amount threshold of the first user.

In the embodiment of the application, the usable total amount of the reference resources deployed in the deep learning training platform is determined, and the use amount threshold of each user in the plurality of users is determined according to the usable total amount. The usage amount threshold is used for indicating the amount of the reference resources which can be used by the tasks submitted by the corresponding users, so that the resource management method can be used for distributing the reference resources according to each user, so that the subsequent users can execute the submitted tasks according to the distributed resources, and the utilization rate of the reference resources is improved.

Fig. 2 is a block diagram of an apparatus for managing resources according to an embodiment of the present application. As shown in fig. 2, the apparatus includes:

a first determining module 201, configured to determine a total amount of available reference resources deployed in the deep learning training platform;

a second determining module 202 is configured to determine usage thresholds of the plurality of users according to the total available usage, where the usage thresholds are used to indicate the amount of reference resources that can be used by the task submitted by the user.

the second determining module is specifically configured to:

Optionally, the second determining module is specifically configured to:

displaying a resource allocation prompt message, wherein the resource allocation prompt message carries the total usable amount and/or task resource requirements of each user and is used for indicating an administrator to allocate reference resources according to the total usable amount and/or task resource requirements of each user so as to obtain usage amount thresholds of a plurality of users;

And detecting a first reporting instruction triggered by an administrator, wherein the first reporting instruction carries the usage amount threshold values of a plurality of users.

Optionally, the first determining module is specifically configured to:

determining the maximum storage capacity of a memory when the deep learning training platform is initialized;

displaying a capacity prompt message, wherein the capacity prompt message carries a maximum storage capacity and is used for indicating an administrator to divide a memory into public storage resources and private storage resources according to the maximum storage capacity so as to obtain a theoretical maximum storage capacity of the public storage resources and a theoretical maximum storage capacity of the private storage resources;

detecting a second reporting instruction triggered by an administrator, wherein the second reporting instruction carries the theoretical maximum storage capacity of the public storage resource and the theoretical maximum storage capacity of the private storage resource, and the theoretical maximum storage capacity of the private storage resource is used as the usable total amount of the private storage resource.

Optionally, the first determining module is specifically configured to:

if the current time is an update time point, determining the historical total use time length of the GPU in a first reference time period which is before the current time and is nearest to the current time, wherein the first reference time period is the time length between two adjacent update time points, and taking the historical total use time length as the available total amount of the GPU.

Optionally, the first determining module is specifically configured to:

determining the weight of each GPU according to the type of each GPU;

and determining the theoretical total use time length of the GPUs according to the weight of each GPU in the GPUs and the theoretical use time length of each GPU, and taking the theoretical total use time length as the usable total amount of the GPUs.

Optionally, the first determining module is further specifically configured to:

Optionally, the first determining module is specifically configured to:

detecting a fourth reporting instruction triggered by an administrator, wherein the fourth reporting instruction carries the theoretical total using time length of the GPU, and the theoretical total using time length is used as the usable total amount of the GPU.

Optionally, the apparatus further comprises:

the sending module is used for sending recommendation information to a plurality of users, wherein the recommendation information carries a recommendation time period and is used for indicating the plurality of users to submit tasks needing to call the GPU in the recommendation time period.

Optionally, the apparatus further comprises:

The generation module is used for generating and displaying prompt information if the historical usage amount is greater than or equal to the usage amount threshold of the first user, wherein the prompt information is used for indicating that the first task cannot be executed currently.

Optionally, the apparatus further comprises:

the receiving module is used for receiving a usage threshold up-regulation request sent by a first user;

the adjusting module is used for adjusting the usage threshold of the first user according to the usage threshold up-regulation request if the current time reference resource has residual usage, so that the historical usage is smaller than the adjusted usage threshold;

and the execution module is used for executing the first task.

Optionally, the apparatus further comprises:

the detection module is used for detecting a usage threshold up-regulation instruction triggered by an administrator, wherein the usage threshold up-regulation instruction carries a usage threshold redistributed by the administrator to a first user;

and the execution module is used for executing the first task.

In the embodiment of the application, the usable total amount of the reference resources deployed in the deep learning training platform is determined, and the usage threshold of a plurality of users is determined according to the usable total amount. The usage amount threshold is used for indicating the amount of the reference resources which can be used by the tasks submitted by the users, so that the method for managing the resources can be used for distributing the reference resources according to each user, so that the subsequent users can execute the submitted tasks according to the distributed resources, and the utilization rate of the reference resources is improved.

In the device for managing resources provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the device for managing resources and the method embodiment for managing resources provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the device for managing resources are detailed in the method embodiment, which is not described herein again.

In the embodiment of the present application, the deep learning task may include a training task of a neural network model, or may include a task of performing various functions such as target recognition (face recognition, human body recognition, vehicle recognition, license plate recognition, etc.), behavior recognition, target tracking, and speech recognition by using a neural network, or may include other tasks related to deep learning, which is not limited herein.

Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application. Specifically, the present invention relates to a method for manufacturing a semiconductor device.

The computer device 300 includes a Central Processing Unit (CPU) 301, a system memory 304 including a Random Access Memory (RAM) 302 and a Read Only Memory (ROM) 303, and a system bus 305 connecting the system memory 304 and the central processing unit 301. Computer device 300 also includes a basic input/output system (I/O system) 306, which facilitates the transfer of information between various devices within the computer, and a mass storage device 307 for storing an operating system 313, application programs 314, and other program modules 315.

The basic input/output system 306 includes a display 308 for displaying information and an input device 309, such as a mouse, keyboard, etc., for user input of information. Wherein both the display 308 and the input device 309 are coupled to the central processing unit 301 via an input output controller 310 coupled to the system bus 305. The basic input/output system 306 may also include an input/output controller 310 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 310 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 307 is connected to the central processing unit 301 through a mass storage controller (not shown) connected to the system bus 305. The mass storage device 307 and its associated computer-readable media provide non-volatile storage for the computer device 300. That is, the mass storage device 307 may include a computer readable medium (not shown) such as a hard disk or CD-ROM drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 304 and mass storage device 307 described above may be collectively referred to as memory.

According to various embodiments of the present application, the computer device 300 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 300 may be connected to the network 312 through a network interface unit 311 coupled to the system bus 305, or alternatively, the network interface unit 311 may be used to connect to other types of networks or remote computer systems (not shown).

The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU. The one or more programs include instructions for performing the methods of managing resources provided by the embodiments of the present application.

The present application also provides a non-transitory computer readable storage medium, which when executed by a processor of a computer device, enables the computer device to perform the method for managing resources provided by the above embodiments.

The present embodiments also provide a computer program product containing instructions which, when run on a computer device, cause the computer device to perform the method of managing resources provided by the above embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to the particular embodiments of the present application.

Claims

1. A method of managing resources, the method comprising:

determining the usable total amount of reference resources deployed in a deep learning training platform, wherein the reference resources are image processing units (GPUs) deployed on the deep learning training platform, and the usable total amount of the reference resources is used for indicating the usable time length of the GPU;

determining a usage threshold of a plurality of users according to the total usable amount, wherein the usage threshold is used for indicating the amount of the reference resource which can be used by the task submitted by the user;

2. The method of claim 1, wherein the plurality of users are divided into a plurality of user groups, each of the plurality of user groups being configured with an allocation ratio;

the determining the usage threshold of the plurality of users according to the total usable amount comprises the following steps:

3. The method of claim 1, wherein said determining usage thresholds for a plurality of users based on said total amount of available usage comprises:

4. The method of claim 1, wherein the memory deployed on the deep learning training platform comprises public memory resources and private memory resources, the reference resources being private memory resources in the memory, the total amount of available reference resources being used to indicate available memory capacity of the private memory resources.

5. The method of claim 4, wherein determining the total amount of available reference resources deployed in the deep learning training platform comprises:

6. The method of claim 1, wherein the determining the total amount of available reference resources deployed in the deep learning training platform comprises:

7. The method of claim 1, wherein the determining the total amount of available reference resources deployed in the deep learning training platform comprises:

determining the weight of each GPU according to the type of each GPU;

8. The method of claim 1, wherein the determining the weight of each GPU based on the type of each GPU comprises:

9. The method of claim 1, wherein the determining the total amount of available reference resources deployed in the deep learning training platform comprises:

10. The method of claim 1, wherein the method further comprises:

11. The method of claim 10, wherein after generating and displaying the alert message, further comprising:

Receiving a usage threshold up-regulation request sent by the first user;

and executing the first task.

12. The method of claim 10, wherein after generating and displaying the alert message, further comprising:

and executing the first task.

13. An apparatus for managing resources, the apparatus comprising:

the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining the usable total amount of reference resources deployed in a deep learning training platform, the reference resources are image processing units (GPU) deployed on the deep learning training platform, and the usable total amount of the reference resources is used for indicating the usable time length of the GPU;

a second determining module, configured to determine usage thresholds of a plurality of users according to the total usable amount, where the usage thresholds are used to indicate an amount of the reference resource that can be used by a task submitted by a user;

The apparatus further comprises:

14. A computer readable storage medium having stored thereon instructions which, when executed by a processor, implement the steps of the method of any of the preceding claims 1 to 12.