CN111324457A

CN111324457A - Method, device, equipment and medium for issuing inference service in GPU cluster

Info

Publication number: CN111324457A
Application number: CN202010094136.9A
Authority: CN
Inventors: 袁绍; 辛永欣
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-02-15
Filing date: 2020-02-15
Publication date: 2020-06-23

Abstract

The invention discloses a method for issuing inference service in a GPU cluster, which comprises the following steps: acquiring resource information and a mirror image of resources required by inference service; acquiring a service release script required by starting the inference service, and packaging the service release script; sending the resource information to a scheduling system and receiving cluster resources distributed by the scheduling system for inference services according to the resource information; establishing an inference service environment according to the mirror images and the distributed cluster resources; and issuing inference service through the API service according to the provided service issuing script. The invention also discloses a corresponding device, equipment and medium. The invention can rapidly schedule resources to create the AI environment aiming at the GPU cluster scene.

Description

Method, device, equipment and medium for issuing inference service in GPU cluster

Technical Field

The present invention relates to the field of image processing, and in particular, to a method, an apparatus, a device, and a medium for issuing inference service in a GPU cluster.

Background

With the explosive development of the artificial intelligence industry, the computing resources and data resources required by the application of the AI are increasingly huge, most AI companies choose to construct GPU clusters, but with the enlargement of cluster scale, how to rapidly distribute the GPU computing resources and the establishment of inference environments of the GPU clusters become problems faced by all large companies.

At present, the first mode generally adopted for issuing inference services in a cluster is as follows: manually specifying cluster nodes, manually installing cuda, nvidia drivers, deep learning frameworks required by the model on the host by the client, and issuing reasoning services according to the dependence required by the model.

At present, the mode two generally adopted for issuing reasoning service in a cluster is as follows: the method comprises the steps of manually appointing GPU nodes in a virtual machine mode, installing virtual machine images to start reasoning services, enabling the GPU to support the virtual machine poorly, enabling the GPU to be directly connected with all GPU cards of a host machine to be connected into the virtual machine, manually installing cuda and nvida drivers, a deep learning framework and various dependencies, and issuing reasoning services.

The inference service is manually deployed on a host machine, GPU resources cannot be automatically distributed, various service characteristics cannot be detected, and the inference service is deployed on a physical machine, so that the problems of dependence conflict, port conflict and the like can exist.

In addition, the virtual machine has huge mirror images and poor portability, and the virtual machine can only map all GPU cards into the machine in a direct-through mode, so that the GPU cards cannot be flexibly distributed. Inference files generated by different frameworks are issued with different inference services, and cannot be issued in a uniform mode.

Disclosure of Invention

In view of the above, embodiments of the present invention provide a method, an apparatus, a device, and a medium for issuing inference services in a GPU cluster, so as to solve the above technical problems.

Based on the above object, one aspect of the present invention provides a method for issuing inference service in a GPU cluster, the method comprising: acquiring resource information and a mirror image of resources required by inference service; acquiring a service release script required by starting the inference service, and packaging the service release script; sending the resource information to a scheduling system and receiving cluster resources distributed by the scheduling system for inference services according to the resource information; establishing an inference service environment according to the mirror images and the distributed cluster resources; and issuing inference service through the API service according to the provided service issuing script.

In some embodiments of the method for issuing inference services in a GPU cluster of the present invention, the resource information of the resource comprises: hardware resource information, CPU core number, video memory size and GPU number.

In some embodiments of the method for issuing inference services in a GPU cluster of the present invention, acquiring resource information and mirroring of resources required for inference services further comprises: selecting whether GPU resources are needed according to the inference service, and responding to the need of the GPU resources, and configuring the minimum granularity of GPU resource requirements into a single card.

In some embodiments of the method for publishing inference services in a GPU cluster of the present invention, the method further comprises: and establishing a mirror image management warehouse which is configured to manage and detect the inference service environment.

In some embodiments of the method for issuing inference services in a GPU cluster of the present invention, issuing inference services through an API service according to a provided service issuing script further comprises: and uniformly issuing the inference service into an API interface through flash service according to the provided service issuing script.

In another aspect of the embodiments of the present invention, a device for issuing inference service in a GPU cluster is further provided, where the device includes: the information acquisition module is configured to acquire resource information and a mirror image of resources required by the reasoning service; the script starting module is configured to obtain a service release script required by the inference service starting and is packaged with the service release script; the resource allocation module is configured to send the resource information to the scheduling system and receive cluster resources allocated to the inference service by the scheduling system according to the resource information; the environment building module is configured to build a reasoning service environment according to the mirror image and the distributed cluster resources; and the reasoning issuing module is configured to issue the reasoning service through the API service according to the provided service issuing script.

In some embodiments of the apparatus for issuing inference services in a GPU cluster of the present invention, the apparatus further comprises: and the management warehouse module is configured to establish a mirror image management warehouse which is configured to manage and detect the inference service environment.

In some embodiments of the apparatus for publishing inference services in a GPU cluster of the present disclosure, the inference publishing module is further configured to: and uniformly issuing the inference service into an API interface through flash service according to the provided service issuing script.

In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing a computer program operable on the processor, the processor executing the program to perform the aforementioned method of issuing inference services in a cluster of GPUs.

In another aspect of the embodiments of the present invention, a computer-readable storage medium is further provided, where a computer program is stored, and is characterized in that when being executed by a processor, the computer program performs the foregoing method for issuing inference services in a GPU cluster.

The invention has at least the following beneficial technical effects: the invention provides a method for rapidly issuing inference service in a GPU cluster. Resources can be rapidly scheduled to create an AI environment for a GPU cluster scene. The cluster administrator can be separated from complex resource allocation and environment management work. The method for issuing the reasoning service in a unified way is provided, and the reasoning service trained by different frameworks can be deployed in a unified way and issued in a unified way.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 shows a schematic block diagram of an embodiment of a method of publishing inference services in a cluster of GPUs, in accordance with the present invention;

FIG. 2 shows a flowchart of an embodiment of a method of publishing inference services in a cluster of GPUs according to the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it is understood that "first" and "second" are only used for convenience of description and should not be construed as limiting the embodiments of the present invention, and the descriptions thereof in the following embodiments are omitted.

Based on the above object, a first aspect of the embodiments of the present invention provides an embodiment of a method for issuing inference service in a GPU cluster. Fig. 1 shows a schematic block diagram of an embodiment of a method for publishing inference services in a cluster of GPUs according to the invention. In the embodiment shown in fig. 1, the method comprises at least the following steps:

s100, acquiring resource information and a mirror image of resources required by inference service;

s200, acquiring a service release script required by starting the reasoning service, and packaging the service release script;

s300, sending the resource information to a scheduling system and receiving cluster resources distributed to the reasoning service by the scheduling system according to the resource information;

s400, establishing an inference service environment according to the mirror images and the distributed cluster resources;

and S500, issuing inference service through API service according to the provided service issuing script.

FIG. 2 is a flow diagram illustrating an embodiment of a method for publishing inference services in a cluster of GPUs according to the present invention. In some embodiments of the invention as shown in fig. 2, the method is implemented as follows: the customer selects resource information required for the service. Selecting mirror images required by inference service, and uniformly transmitting scheduling system interfaces; the client needs to select a service starting script required by model service starting, and is characterized in that the service issuing script is packaged, and model input and output are set; the method comprises the steps that a client sends required resource information (cpu/gpu) information to a scheduling system developed based on pbs uniformly, and the pbs scheduling system dynamically allocates cluster resources; after the resources are allocated, a reasoning service environment is built according to the mirror name and the used resources, and the scheme uniformly uses docker to issue reasoning services. And managing the inference service environment uniformly. After the mirror image is pulled, the deep learning drive can be collected, the scheme process uses a native docker mapping drive instead of nvidia-docker, and the limitation of nvidia-docker is avoided. The required resources and development codes are mapped to the environment in the mirror image starting process, and the development environment is started;

wherein, its exemplary core processing code is as follows:

after the reasoning environment is built, the reasoning service is issued as an API interface according to the provided reasoning service script, and a client can issue a model through the API service.

According to some embodiments of the method for issuing inference services in a GPU cluster of the present invention, the resource information of the resource comprises: hardware resource information, CPU core number, video memory size and GPU number.

In some embodiments of the invention, the customer selects resource information required for the service, the resource information comprising: hardware resources, CPU core number, video memory size, whether GPU resources are needed or not is selected according to the model, and if GPU resources are needed, the GPU number is selected according to the needed condition.

According to some embodiments of the method for issuing inference services in a GPU cluster of the present invention, acquiring resource information and mirroring of resources required for the inference services further comprises: selecting whether GPU resources are needed according to the inference service, and responding to the need of the GPU resources, and configuring the minimum granularity of GPU resource requirements into a single card.

In some embodiments of the present invention, a client sends required resource information (cpu/GPU) to a scheduling system developed based on pbs uniformly, and the pbs scheduling system dynamically allocates cluster resources, such as GPU resources, if required, and a single card is used according to the minimum granularity of resource required resources.

According to some embodiments of the method for issuing inference services in a GPU cluster of the present invention, the method further comprises: and establishing a mirror image management warehouse which is configured to manage and detect the inference service environment.

In some embodiments of the invention, a docker container scheme is adopted in a training environment of the scheme, and a unified docker image management warehouse is built. And managing the inference service environment uniformly.

According to some embodiments of the method for issuing inference services in a GPU cluster of the present invention, issuing inference services through an API service according to a provided service issuing script further comprises: and uniformly issuing the inference service into an API interface through flash service according to the provided service issuing script.

In some embodiments of the invention, after the inference environment is built, the inference service is uniformly issued as the API interface through the flash service according to the provided inference service script, and the client can issue the model through the API service.

On the other hand, the embodiment of the invention provides a device for issuing inference service in a GPU cluster. The device includes:

the information acquisition module is configured to acquire resource information and a mirror image of resources required by the reasoning service;

the script starting module is configured to acquire a service release script required by the inference service starting and package the service release script;

the resource allocation module is configured to send the resource information to the scheduling system and receive cluster resources allocated to the inference service by the scheduling system according to the resource information;

the environment building module is configured to build a reasoning service environment according to the mirror image and the distributed cluster resources;

and the reasoning issuing module is configured to issue the reasoning service through the API service according to the provided service issuing script.

According to some embodiments of the apparatus for issuing inference services in a GPU cluster of the present invention, the apparatus further comprises: and the management warehouse module is configured to establish a mirror image management warehouse which is configured to manage and detect the inference service environment.

According to some embodiments of the apparatus for publishing inference services in a GPU cluster of the present disclosure, the inference publishing module is further configured to: and uniformly issuing the inference service into an API interface through flash service according to the provided service issuing script.

In view of the above object, another aspect of the embodiments of the present invention further provides a computer device, including: at least one processor; and a memory storing a computer program operable on the processor, the processor executing the program to perform the aforementioned method of issuing inference services in a cluster of GPUs.

Likewise, those skilled in the art will appreciate that all of the embodiments, features and advantages set forth above with respect to the method of publishing inference services in a cluster of GPUs according to the present invention apply equally well to the apparatus, computer devices and media according to the present invention. For the sake of brevity of the present disclosure, no repeated explanation is provided herein.

It should be particularly noted that, the steps in the embodiments of the method, apparatus, device and medium for issuing inference services in a GPU cluster described above may be mutually intersected, replaced, added and deleted, so that these methods, apparatuses, devices and media for issuing inference services in a GPU cluster, which are transformed by reasonable permutation and combination, should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiments.

Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and the program of the method for issuing inference services in the GPU cluster can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.

Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.

Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method for issuing inference services in a GPU cluster, the method comprising:

acquiring resource information and a mirror image of resources required by inference service;

acquiring a service release script required for starting the inference service, and packaging the service release script;

sending the resource information to a scheduling system and receiving cluster resources distributed to the reasoning service by the scheduling system according to the resource information;

establishing an inference service environment according to the mirror image and the distributed cluster resources;

and issuing the inference service through an API service according to the provided service issuing script.

2. The method for publishing inference services in a GPU cluster according to claim 1, wherein the resource information of the resources comprises: hardware resource information, CPU core number, video memory size and GPU number.

3. The method of claim 1, wherein the obtaining resource information and mirroring of resources required for the inference service further comprises:

selecting whether GPU resources are needed according to the reasoning service, and responding to the GPU resources needed, and configuring the minimum granularity of the GPU resource needs into a single card.

4. The method for publishing inference services in a GPU cluster as recited in claim 1, wherein the method further comprises:

establishing a mirror management repository configured to manage and detect the inference service environment.

5. The method of claim 1, wherein publishing the inference service via an API service according to the provided service publishing script further comprises:

and uniformly issuing inference services as API interfaces through flash services according to the provided service issuing scripts.

6. An apparatus for publishing inference services in a cluster of GPUs, the apparatus comprising:

the script starting module is configured to acquire a service release script required for starting the inference service and package the service release script;

the resource allocation module is configured to send the resource information to a scheduling system and receive cluster resources allocated to the inference service by the scheduling system according to the resource information;

and the reasoning issuing module is configured to issue the reasoning service through an API service according to the provided service issuing script.

7. The apparatus for publishing inference service in a GPU cluster as recited in claim 6, wherein the apparatus further comprises:

a management repository module configured to establish a mirror management repository configured to manage and detect the inference service environment.

8. The apparatus for publishing inference service in a GPU cluster of claim 6, wherein the inference publishing module is further configured to:

9. A computer device, comprising:

at least one processor; and

memory storing a computer program operable on the processor, wherein the processor, when executing the program, performs the method of any of claims 1-5.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 5.