CN112698922A

CN112698922A - Resource scheduling method, system, electronic device and computer storage medium

Info

Publication number: CN112698922A
Application number: CN202110053688.XA
Authority: CN
Inventors: 赵铭; 易文峰; 杨正刚; 李小芬; 杨育; 徐文娟
Original assignee: Shenzhen Digital Power Grid Research Institute of China Southern Power Grid Co Ltd
Current assignee: Southern Power Grid Digital Grid Research Institute Co Ltd; Shenzhen Digital Power Grid Research Institute of China Southern Power Grid Co Ltd
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2021-04-23

Abstract

The application discloses a resource scheduling method, a resource scheduling system, electronic equipment and a computer storage medium, and relates to the technical field of containerization. The resource application states of the resource devices are obtained, the resource devices for allocation are determined according to the resource application states of the resource devices, the training data are obtained through the determined resource devices, the container is deployed accurately for the processor of the resource devices, the training data are trained through the processor of the resource device scheduled by the container mirror image, the single GPU card resource on the GPU machine can be allocated and scheduled accurately, and the GPU resource utilization rate is improved.

Description

Resource scheduling method, system, electronic device and computer storage medium

Technical Field

The present application relates to the field of containerization technologies, and in particular, to a resource scheduling method, system, electronic device, and computer storage medium.

Background

With the application of novel technologies such as artificial intelligence, image recognition and neural networks becoming more and more extensive, the demands on GPU (Graphics Processing Unit) card resources become more and more large, and manual intervention is needed to perform allocation and scheduling in the process of performing deep learning through GPU resources, so that the GPU resources can be effectively applied, but the current GPU machine is often integrated by a plurality of GPU cards and cannot perform fine allocation and scheduling on a single GPU card on one GPU machine, and deep learning and training are performed on training data by using the GPU cards.

Disclosure of Invention

The present application is directed to solving at least one of the problems in the prior art. Therefore, the resource scheduling method is provided, which can finely allocate and schedule single GPU card resources on a GPU machine, and improve the utilization rate of the GPU resources.

The application also provides a resource scheduling system with the resource scheduling method.

The application also provides the electronic equipment with the resource scheduling method.

The application also provides a computer readable storage medium with the resource scheduling method.

The resource scheduling method according to the embodiment of the first aspect of the application comprises the following steps: acquiring resource application states of a plurality of resource devices;

determining resource equipment for allocation according to the resource application states of the plurality of resource equipment;

acquiring training data through the resource equipment, and deploying a container mirror image for a processor of the resource equipment;

and scheduling the processor of the resource device to train the training data through the container mirror image.

The resource scheduling method according to the embodiment of the application has at least the following beneficial effects: the resource application states of the resource devices are obtained, the resource devices for allocation are determined according to the resource application states of the resource devices, the training data are obtained through the determined resource devices, the container is deployed accurately for the processor of the resource devices, the training data are trained through the processor of the resource device scheduled by the container mirror image, the single GPU card resource on the GPU machine can be allocated and scheduled accurately, and the GPU resource utilization rate is improved.

According to some embodiments of the application, the container mirror is generated from a container object, the container object being obtained by:

building a dependency environment for the container object;

and deploying application components and application software on the container object based on the dependency environment, wherein the application software is deployed based on the running environment depended by the application software.

According to some embodiments of the present application, the deploying a container image for a processor of the resource device comprises:

generating the container mirror image according to the processor type of the processor and the container object;

deploying the container image for the corresponding appliance.

According to some embodiments of the present application, the obtaining, by the resource device, training data comprises:

and issuing an acquisition instruction to the resource equipment so that the resource equipment acquires corresponding training data from a file database according to the acquisition instruction.

According to some embodiments of the present application, the training data by the processor scheduling the resource device through the container mirror includes:

acquiring a plurality of processors of the resource device through the container mirror image;

configuring corresponding identification marks for the plurality of processors;

and scheduling a corresponding processor according to the identity label to train the training data.

According to some embodiments of the present application, the issuing of the acquisition instruction to the resource device includes:

and issuing the acquisition instruction to a container mirror deployed by a processor of the resource equipment, and scheduling the processor to acquire corresponding training data by the container mirror according to the acquisition instruction.

According to some embodiments of the application, further comprising:

and if the resource equipment cannot be determined according to the resource application states of the plurality of resource equipment, performing queue processing based on the resource application states of the plurality of resource equipment, and waiting for acquiring the resource equipment for resource scheduling.

The resource scheduling system according to the second aspect of the present application includes:

an obtaining module, configured to obtain resource application states of a plurality of resource devices;

the determining module is used for determining resource equipment for allocation according to the resource application states of the plurality of resource equipment;

the processing module is used for acquiring training data through the resource equipment and deploying a container mirror image for a processor of the resource equipment;

and the scheduling module is used for scheduling the processor of the resource equipment to train the training data through the container mirror image.

The resource scheduling system according to the embodiment of the application has at least the following beneficial effects: the resource application states of a plurality of resource devices are obtained through the obtaining module, the determining module determines the resource devices to be allocated according to the resource application states of the resource devices, the processing module obtains training data through the determined resource devices and deploys containers for the processors of the resource devices finely, the scheduling module schedules the training data through the processors of the resource devices in a container mirror image mode, fine distribution and scheduling can be conducted on single GPU card resources on a GPU machine, and the GPU resource utilization rate is improved.

An electronic device according to an embodiment of a third aspect of the present application includes: at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions for execution by the at least one processor to cause the at least one processor, when executing the instructions, to implement the method of resource scheduling according to the first aspect.

According to the electronic equipment of this application, have at least following beneficial effect: by executing the resource scheduling method in the first aspect, the resources of a single GPU card on a GPU machine can be finely distributed and scheduled, and the utilization rate of the GPU resources is improved.

A computer-readable storage medium according to an embodiment of a fourth aspect of the present application, the computer-readable storage medium storing computer-executable instructions for causing a computer to perform the resource scheduling method according to the first aspect.

The computer-readable storage medium according to the present application has at least the following advantageous effects: by executing the resource scheduling method in the first aspect, the resources of a single GPU card on a GPU machine can be finely distributed and scheduled, and the utilization rate of the GPU resources is improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

FIG. 1 is a flowchart illustrating a resource allocation method according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a step 300 of a resource allocation method according to an embodiment of the present application;

FIG. 3 is a diagram of an exemplary application of the resource allocation method in the embodiment of the present application;

FIG. 4 is another flowchart illustrating a step 300 of a resource allocation method according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a step 400 of a resource allocation method according to an embodiment of the present application;

fig. 6 is a schematic block diagram of a resource allocation system according to an embodiment of the present disclosure.

Reference numerals:

an acquisition module 100, a determination module 200, a processing module 300, and a scheduling module 400. .

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

It should be noted that the logical order is shown in the flowcharts, but in some cases, the steps shown or described may be performed in an order different from the flowcharts. If the term "a number" is used, it is intended to mean more than one, if the term "a number" is used, it is intended to mean more than two, and if the term "less than one" is used, it is intended to include the number. The use of any and all examples, or exemplary language ("e.g.," such as "etc.), provided herein is intended merely to better illuminate embodiments of the application and does not pose a limitation on the scope of the application unless otherwise claimed. The terms greater than, less than, more than, etc. are understood to exclude the essential numbers, and the terms greater than, less than, and the like are understood to include the essential numbers. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

It is noted that, as used in the examples, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any combination of one or more of the associated listed items.

Based on this, the embodiments of the present application provide a resource scheduling method, system, electronic device, and computer storage medium, which can perform fine allocation and scheduling on a single GPU card resource on a GPU machine, thereby improving the GPU resource utilization.

It should be noted that the resource scheduling method and the resource scheduling system mentioned in the embodiments of the present application are applicable to the deployment of GPU resources, and are not limited to the application in the deep learning field mentioned in the embodiments of the present application, and the embodiments of the present application are explained only by taking deep learning as an example.

In a first aspect, an embodiment of the present application provides a resource scheduling method.

In some embodiments, referring to fig. 1, a flowchart illustrating a resource scheduling method in an embodiment of the present application is shown. The method specifically comprises the following steps:

s100, acquiring resource application states of a plurality of resource devices;

s200, determining resource equipment for allocation according to the resource application states of the plurality of resource equipment;

s300, acquiring training data through the resource equipment, and deploying container mirror images for a processor of the resource equipment;

s400, training the training data through a processor of the container mirror image scheduling resource device.

In step S100, resource application states of a plurality of resource devices in the server cluster are obtained, where a resource device refers to a cluster device that can be used for training, and in actual application, the resource devices in the server cluster may be queried through the AI platform to obtain corresponding application states; the AI platform has the capability of uniformly controlling the server cluster; the server cluster comprises a plurality of resource devices and works simultaneously; the resource application state refers to the working state of the current resource device, and the current resource device can be judged to be in working or idle through the resource application state.

In some embodiments, if the resource application state of the resource device in the server cluster is queried through the AI platform, that is, no idle resource device exists currently, queue processing is performed based on the resource application states, that is, specific working states, of the plurality of resource devices, where the queue processing refers to queuing the resource device and waiting for resource allocation of the initially idle resource device.

In step S200, after the resource application states of the multiple resource devices are obtained, the resource devices that need to be allocated may be determined according to the resource application states, specifically, the resource devices that are idle in the current server cluster are determined according to the resource application states, so that the resource devices are determined to be allocated.

In step S300, training data to be trained is acquired by the deployed resource device, and a container mirror is deployed for the processor of the current resource device. The container image can be regarded as a special file system, which contains some configuration parameters (such as anonymous volume, environment variable, user, etc.) prepared for the runtime in addition to providing the files of programs, libraries, resources, configuration, etc. required by the container runtime. The image does not contain any dynamic data, nor does its content be changed after construction.

In some embodiments, the training data is acquired by the resource device, specifically, the acquisition instruction is issued to the resource device by the AI platform, so that the resource device acquires the corresponding training data from the file database according to the acquisition instruction. Specifically, the AI platform starts a training task, and issues an acquisition instruction to the resource device, where the acquisition instruction is used to inform the resource device of training data that needs to be trained, and the resource device downloads corresponding training data to the file server according to the acquisition instruction.

In some embodiments, the container mirror is generated by a container object, where the container object is obtained by referring to the method shown in fig. 2, which specifically includes the steps of:

s310, constructing a dependent environment of the container object;

s320, deploying the application components and the application software on the container object based on the dependency environment.

In step S310 and step S320, the application containerization technique deploys the application components and the application software that need to be used on the container object with the built dependency environment by writing the dependency environment of the container object, where the application components refer to application platform systems and the like, and the application software refers to the dependent application software that the application components need; when the application software is installed and used, the application software needs to be deployed based on the respective operating environment. In practical application, referring to fig. 3, the application component is an artificial intelligence application platform, and the application software required by the artificial intelligence application is software such as tensrflow, OpenCV, jieba library, flash, and the like, and the application software needs to be integrally deployed with a corresponding operating environment, so that the application component and the application software are installed on a container to generate a container object required in the embodiment of the present application. The TensorFlow is a symbolic mathematical system based on data flow programming (dataflow programming), and is widely applied to programming realization of various machine learning (machine learning) algorithms; OpenCV is a BSD license (open source) based distributed cross-platform computer vision and machine learning software library; the jieba library is an excellent Python third-party Chinese word-classifying library; flash is a lightweight Web application framework written using Python. The WSGI tool box adopts Werkzeug, the template engine adopts Jinja2, and the flash uses BSD authorization. Flash is also called "microframe" because it uses a simple kernel, adds other functionality with extension, and has no database, form verification tool to use by default.

It should be noted that, by deploying the container mirror image on the processor, the influence on the operating environment can be quickly removed in the subsequent process, and the original operating environment is rolled back.

In some embodiments, the container mirror image is generated by a constructed container object, specifically obtained by the method shown in fig. 4, and specifically includes the steps of:

s330, generating a container mirror image according to the processor type and the container object of the processor;

and S340, deploying container mirror images for the corresponding equipment devices.

In step S330 and step S340, the number of processors and the types of processors on the resource device are obtained, it should be noted that the resource device includes a plurality of processors, the processors are GPU cards, and the types of different GPU cards are different in practical application, that is, different manufacturers of GPU cards leave factories differently; due to different manufacturers, when the container mirror image is actually deployed, the corresponding container mirror image needs to be generated according to different manufacturers, so that the container mirror image can correctly call the GPU card.

In the embodiment of the application, the corresponding container mirror images are deployed for the multiple processors of the resource device, so that one-to-one access and deployment of GPU card resources can be realized. The container mirror images corresponding to the GPU card resources are isolated from resources, interference on other container mirror images is avoided, and therefore the problem of resource preemption in the artificial intelligence operation process is caused; moreover, artificial intelligence applications which need to run in different environments are deployed in different containers through a containerization technology, required software and components are installed in the containers, and running environments outside the containers are consistent, so that the problem that different running environments run in the same physical machine is solved.

In some embodiments, issuing the acquisition instruction to the resource device through the AI platform is specifically issuing the acquisition instruction to a container mirror deployed by a corresponding processor on the resource device through the AI platform, and scheduling the processor of the resource device to acquire the corresponding training data according to the acquisition instruction by the container mirror. In practical application, the acquisition instruction issued by the AI platform is an artificial intelligence application platform transferred to a container mirror image deployed by a processor of the resource device, and the artificial intelligence application platform can call the processor to acquire training data according to the received acquisition instruction, that is, download the training data from the file server.

In step S400, the training data may be trained through the processor of the container mirror image scheduling resource device, in practical application, the artificial intelligence application platform in the container mirror image establishes a communication connection with the AI platform, and the AI platform may directly schedule the corresponding processor through the artificial intelligence application platform in the container mirror image.

In some embodiments, referring to fig. 5, step S400 further includes the following steps:

s410, acquiring a plurality of processors of the resource equipment through container mirror images;

s420, configuring corresponding identity identifications for a plurality of processors;

and S430, scheduling the corresponding processor according to the identity to train the training data.

In step S410, after the container mirror is deployed on the corresponding processor, the container mirror may access the processor to determine data of the current processor, such as resource occupation status, processor type, and the like, the AI platform is connected to the container mirror through the API interface, and when the container mirror is in normal operation, the API interface queries the number of processors and processors owned by the resource device.

In step S420, corresponding identifiers are configured for different processors according to the obtained information corresponding to the processors and the processors, where the identifiers refer to numbers of the processors and are used for identifying the processors. In practical application, the GPU card is numbered, so that the GPU card has a corresponding number, and the GPU card is convenient to identify and identify.

In step S430, when the container mirror image is running, the AI platform may designate a corresponding GPU card to train the training data through the serial number of the GPU card by acquiring the application condition of each GPU card, and may directly designate a plurality of GPU card resources to train the training data respectively in practical application, thereby ensuring that the multithreading performs artificial intelligence training and improving the efficiency of data training.

It should be noted that, in the embodiment of the present application, management and control are performed through an AI platform, and an artificial intelligence application platform in a container mirror image is used for receiving an instruction of the AI platform, so as to allocate GPU card resources.

In the embodiment of the application, the resource application states of the plurality of resource devices are acquired, the resource devices to be allocated are determined according to the resource application states of the resource devices, the training data are acquired through the determined resource devices, the container is deployed accurately for the processors of the resource devices, the training data are trained through the processors of the resource devices scheduled by the container mirror image, the single GPU card resource on the GPU machine can be allocated and scheduled accurately, and the GPU resource utilization rate is improved.

In a third aspect, an embodiment of the present application further provides a resource allocation system for executing the resource allocation method in the first aspect.

In some embodiments, referring to fig. 6, a schematic block diagram of a resource scheduling system in an embodiment of the present application is shown. The method specifically comprises the following steps: an acquisition module 100, a determination module 200, a processing module 300 and a scheduling module 400;

the obtaining module 100 is configured to obtain resource application states of a plurality of resource devices;

the determining module 200 is configured to determine resource devices to be allocated according to resource application states of a plurality of resource devices;

the processing module 300 is configured to obtain training data through the resource device, and deploy a container mirror image for a processor of the resource device;

the scheduling module 400 is configured to schedule the processor of the resource device to train the training data through the container mirror.

It should be noted that, specific applications and functional descriptions of the functional modules mentioned in the embodiments of the present application have been described in detail in the embodiments of the first aspect, and therefore are not described herein again.

In the embodiment of the present application, the obtaining module 100 obtains the resource application states of a plurality of resource devices, the determining module 200 determines resource devices to be allocated according to the resource application states of the resource devices, the processing module 300 obtains training data through the determined resource devices and allocates containers precisely for processors of the resource devices, and the scheduling module 400 schedules the training data through the processors of the resource devices according to container images, so as to perform precise allocation and scheduling on a single GPU card resource on a GPU machine, and improve the GPU resource utilization rate.

In a third aspect, an embodiment of the present application further provides an electronic device, including: at least one processor, and a memory communicatively coupled to the at least one processor;

wherein the processor is configured to execute the resource allocation method in the embodiment of the first aspect by calling a computer program stored in the memory.

The memory, which is a non-transitory computer readable storage medium, may be used to store a non-transitory software program and a non-transitory computer executable program, such as the resource allocation method in the first aspect of the present application. The processor implements the resource allocation method in the first embodiment by running a non-transitory software program and instructions stored in the memory.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store the resource allocation method in the embodiment of the first aspect. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement the resource allocation method in the first aspect of the present invention are stored in a memory, and when executed by one or more processors, perform the resource allocation method in the first aspect of the present invention.

In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium storing computer-executable instructions for: performing the resource allocation method in the first aspect;

in some embodiments, the computer-readable storage medium stores computer-executable instructions, which are executed by one or more control processors, for example, by one of the processors in the electronic device of the third aspect, and may cause the one or more processors to perform the resource allocation method of the first aspect.

The above described embodiments of the device are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

In the description herein, references to the description of the terms "some embodiments," "examples," "specific examples," or "some examples," etc., mean that a particular feature or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example.

Claims

1. The resource scheduling method is characterized by comprising the following steps:

acquiring resource application states of a plurality of resource devices;

2. The method according to claim 1, wherein the container mirror is generated from a container object, and the container object is obtained by:

building a dependency environment for the container object;

3. The method according to claim 2, wherein said deploying a container image for the processor of the resource device comprises:

deploying the container image for the corresponding appliance.

4. The method according to claim 3, wherein the obtaining training data by the resource device comprises:

5. The method according to claim 4, wherein the training data by the processor scheduling the resource device through the container mirror comprises:

configuring corresponding identification marks for the plurality of processors;

6. The method according to claim 5, wherein the issuing of the acquisition command to the resource device comprises:

7. The method for scheduling resources according to claim 6, further comprising:

8. A resource scheduling system, comprising:

9. An electronic device, comprising:

at least one processor, and,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions for execution by the at least one processor to cause the at least one processor, when executing the instructions, to implement the resource scheduling method of any one of claims 1 to 7.

10. Computer-readable storage medium, characterized in that it stores computer-executable instructions for causing a computer to perform the method of resource scheduling according to any of claims 1 to 7.