CN118034712A

CN118034712A - Model application deployment system and method

Info

Publication number: CN118034712A
Application number: CN202410186209.5A
Authority: CN
Inventors: 汤泽云; 张子菡; 唐波; 蔺泽浩; 熊飞宇
Original assignee: Shanghai Institute Of Algorithm Innovation
Current assignee: Shanghai Institute Of Algorithm Innovation
Priority date: 2024-02-19
Filing date: 2024-02-19
Publication date: 2024-05-14

Abstract

The invention discloses a model application deployment system, which comprises: the one or more reasoning engine modules are used for executing model reasoning tasks and generating model reasoning results; and the controller module is used for managing and scheduling the reasoning engine module. The invention provides a unified management and scheduling platform, which can perform unified management and scheduling on different types of models and improve management efficiency and resource utilization rate. And unified management of multiple types of models is supported, and management complexity and maintenance cost are reduced. The deployment and personalized load balancing of multiple instances are supported, and the resource utilization rate and the reasoning performance are improved. The invention not only solves the problems of fragmented reasoning frames, difficult multi-type model management, insufficient load balancing and the like in the prior art, but also provides an efficient, unified and personalized large-scale language model application deployment system and method, and has stronger practicability and innovation.

Description

Model application deployment system and method

Technical Field

The invention relates to the field of computers, in particular to a model application deployment system and method.

Background

In the present big data age, large Language Models (LLMs) exhibit excellent performance in the fields of natural language processing, text generation, and the like. A large language model is a model based on deep learning, specifically designed for understanding and generating natural language text. These models are typically transformer architectures and are trained through large-scale data sets to learn complex patterns and structures of language. The reasoning task of the large language model is to generate a text by using the trained large language model, so that the problem is solved. However, deployment and management of these large models face a series of challenges, and large language model reasoning is generally an autoregressive task, which consumes both video memory and computing resources, including incompatibilities of different reasoning frameworks, complex management of multiple models, and personalized requirements of load balancing, and for large-scale models, distributed reasoning is also required, so that throughput and time delay in practical application often cannot meet the requirements of application scenes.

There are a variety of advanced reasoning frameworks on the market at present, and in order to improve the reasoning efficiency of the model, researchers have proposed a variety of technologies, such as kv cache, flash attention, page attention, continuous batching, etc. And engineers have also developed VLLM, TGI, tensorRT-LLM, etc., reasoning frameworks for deploying large model reasoning. But there are different interfaces and deployment logic between them, resulting in large model application deployments that are complex and difficult to unify. Large language models often also need to be interoperable with other small models, and existing frameworks often only support management of a single type of model, lacking an integrated solution. The load balancing strategy aiming at the large model is limited, the requirement of personalized load balancing in the actual scene can not be met, and the reasoning efficiency of the whole system is reduced. The deployment frameworks on the market today, while each advantageous, are often back-end frameworks of a single model, lacking a unified and flexible solution. Meanwhile, the key functions such as load balancing and multi-instance deployment are lacked, and the actual requirements of the production environment cannot be met.

Therefore, those skilled in the art have focused on developing a system and method for deploying a model application to solve the problems of frame fragmentation, the inability of unifying multiple large model frames, the inefficiency of deploying a large model application system, and the inefficiency of the overall system caused by the inability to implement a personalized load balancing strategy.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the present invention aims to solve the problems of lack of a unified management and scheduling platform, inability to perform unified management of multiple types of models, and inability to implement deployment and personalized load balancing of multiple instances in the aspect of application deployment of the prior art models.

To achieve the above object, the present invention provides a model application deployment system, including:

The one or more reasoning engine modules are used for executing model reasoning tasks and generating model reasoning results;

And the controller module is used for managing and scheduling the model application deployment system and the method reasoning engine module.

The model application deployment system is beneficial to the proposal, the unification of various reasoning frameworks is realized, and different types of models can be managed and scheduled uniformly, so that the system has all reasoning optimization technologies on the market at present, and the management efficiency and the resource utilization rate are improved.

Further, the reasoning engine module includes:

One or more large language model reasoning engine modules (LLM-works) for performing large language model reasoning tasks;

One or more generic-model reasoning engine modules (generic-works) for performing generic-model reasoning tasks.

The method and the system benefit from the support of the reasoning engine module to different types of models, realize unified management of multiple types of models, and reduce management complexity and maintenance cost.

Further, the controller module includes:

The task receiving unit is used for receiving a model reasoning request from a user;

the task scheduling unit is used for selecting a proper reasoning engine module to execute a model reasoning task according to the load conditions of the model application deployment system and the method reasoning engine module;

and the task registration unit is used for receiving the registration information of the reasoning engine module and maintaining a list for storing the information of the reasoning engine module.

The method is beneficial to the cooperation of a plurality of functional units of the controller module, achieves the deployment of multiple instances and personalized load balancing, and improves the resource utilization rate and reasoning performance.

Further, the model application deployment system further comprises:

The model storage unit is used for storing the model to be deployed;

And the model management unit is used for managing the deployment, the updating and the unloading of the model.

The system model storage and management system has the advantages of benefiting from the proposal of the model storage and management unit, improving the storage and management efficiency of the system model and reducing the storage and management cost.

Further, the controller module may employ a distributed architecture to improve the scalability and reliability of the system.

The system benefits from the adoption of the distributed architecture, improves the expandability and reliability of the system, and meets the requirements of large-scale model application.

The invention also provides a model application deployment method, which comprises the following steps:

Deploying the model to an inference engine module;

Registering an inference engine module in the controller module;

the user sends a model reasoning request to the controller module;

The task scheduling unit of the controller module selects a proper reasoning engine module according to the load condition of the reasoning engine module;

the task receiving unit of the controller module sends the model reasoning request to the selected reasoning engine module;

the reasoning engine module executes the model reasoning task and generates a model reasoning result;

the reasoning engine module returns the model reasoning result to the controller module;

the task return unit of the controller module returns the model reasoning result to the user.

Further, the models include a large language model and a general model.

Further, the inference engine modules include a large language model inference engine module (LLM-Worker) and a generic model inference engine module (General-Worker).

Further, the task scheduling unit of the controller module may select an appropriate inference engine module using a load balancing algorithm.

Further, the model management unit of the controller module may update or uninstall the inference engine module according to the evaluation result.

The invention also provides a front-end page of the model application deployment system, and a visual realization model application deployment method, which comprises the following steps:

providing a model application deployment model list interface at a front-end module of the model application deployment system, wherein the model application deployment model list interface comprises deployable model information.

Responding to the setting of a user on a model application deployment model list interface to obtain model information to be deployed, matching corresponding hardware resources, setting a deployment method based on a load balancing strategy of a task scheduling unit, namely a deployment object unit, and completing the configuration of the model deployment information.

And responding to the setting of a user on a deployment model configuration interface, transmitting model deployment information to a controller module controller, and scheduling a corresponding worker to complete model application deployment by the controller module controller according to the received model information to be deployed.

By means of the technical scheme, a user can complete deployment of the model application through an intuitive graphical interface without deep knowledge of the underlying technical details. The visual design can greatly reduce the learning cost and the operation difficulty of the user and improve the working efficiency of the user.

Optionally, matching the corresponding hardware resources may also select other parameters required for the model configuration.

By means of the technical scheme, a user can flexibly select matched hardware resources in the model deployment process and configure other needed parameters according to actual requirements. The flexibility and the customizable performance can better meet the requirements of different users, and the applicability and the flexibility of the system are improved.

Optionally, after the model deployment is completed, directly calling the api interface access model corresponding to the deployed model.

By means of the technical scheme, a user can directly access the deployed model through a simple API calling mode, and complex configuration and operation flow are not needed. The convenient access mode can accelerate the development and deployment of the application and improve the usability and efficiency of the system.

Optionally, after the model deployment is completed, selecting test parameters and inputting test contents in a model test interface, and performing visual test on the deployed model.

By means of the technical scheme, a user can select proper test parameters through the model test interface and input test contents, and the deployed model is subjected to visual test. The visual testing mode can intuitively display the performance and effect of the model, and helps a user to better understand the behavior and performance of the model.

The invention provides a model application deployment system and method and a front-end page, which can uniformly manage and schedule different types of models and improve management efficiency and resource utilization rate. And unified management of multiple types of models is supported, and management complexity and maintenance cost are reduced. The deployment and personalized load balancing of multiple instances are supported, and the resource utilization rate and the reasoning performance are improved. The invention not only solves the problems of fragmented reasoning frames, difficult multi-type model management, insufficient load balancing and the like in the prior art, but also provides an efficient, unified and personalized large-scale language model application deployment system and method, and has stronger practicability and innovation. The unified management and scheduling platform is also provided, so that different types of models can be managed and scheduled in a unified way, and the management efficiency and the resource utilization rate are improved. And unified management of multiple types of models is supported, and management complexity and maintenance cost are reduced. The deployment and personalized load balancing of multiple instances are supported, and the resource utilization rate and the reasoning performance are improved.

The conception, specific structure, and technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, features, and effects of the present invention.

Drawings

FIG. 1 is a model application deployment method of a preferred embodiment of the present invention;

FIG. 2 is a schematic diagram of a model application deployment system in accordance with a preferred embodiment of the present invention;

FIG. 3 is a diagram of a model application deployment system front-end module model list interface in accordance with a preferred embodiment of the present invention;

FIG. 4 is a schematic diagram of a model application deployment system front-end module deployment model configuration interface in accordance with a preferred embodiment of the present invention;

FIG. 5 is a schematic diagram of a model application deployment system front-end module model test interface in accordance with a preferred embodiment of the present invention.

Detailed Description

The following description of the preferred embodiments of the present invention refers to the accompanying drawings, which make the technical contents thereof more clear and easy to understand. The present invention may be embodied in many different forms of embodiments and the scope of the present invention is not limited to only the embodiments described herein. Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

The flow diagrams depicted in the figures are exemplary only and not necessarily all steps are included. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

In the drawings, like structural elements are referred to by like reference numerals and components having similar structure or function are referred to by like reference numerals. The dimensions and thickness of each component shown in the drawings are arbitrarily shown, and the present invention is not limited to the dimensions and thickness of each component. The thickness of the components is exaggerated in some places in the drawings for clarity of illustration.

The invention provides a model application deployment system, which comprises:

the reasoning engine module can be one or more and is used for executing model reasoning tasks and generating model reasoning results.

In this embodiment, the present invention provides a controller module-inference engine module, that is, a controller-worker architecture mode, and implements inference engine modules worker of corresponding types for different large model inference frameworks, and then performs unified management on different inference engine modules worker at the controller module controller, where the controller module controller exposes an external unified interface.

The model application deployment system is beneficial to the proposal, the unified management and scheduling can be carried out on different types of models, and the management efficiency and the resource utilization rate are improved.

Optionally, the controller module controller includes:

and the task returning unit is used for returning the model application deployment system and the model reasoning result of the model application deployment method to the user.

In this embodiment, the controller module controller includes:

The task receiving unit, namely the generation interface unit, is an interface which is exposed uniformly to the outside and is used for receiving the call of a user or the front end, the generation interface is used for receiving the model reasoning request of the external user of the system, selecting a proper reasoning engine module worker according to the content of the model reasoning request of the external user, and then calling the generation interface of the corresponding reasoning engine module worker to realize the generation and return of the content.

The task registration unit, i.e. the register interface unit, is used for receiving registration information of the inference engine module worker, and the interface also maintains a list to store related information of the inference engine module worker, so that instances of a plurality of inference engine module workers can be maintained simultaneously.

The task scheduling unit, namely a scheduling object unit, is used for performing load balancing, selecting a proper reasoning engine module to execute a model reasoning task according to the load conditions of the model application deployment system and the method reasoning engine module, calling a scheduling method of a corresponding wrapper object, selecting a proper reasoning engine module worker, and sending a content generation request. The task scheduling unit is a pluggable load balancing module, and comprises a common load balancing algorithm (such as a polling method, a random method and a weighted polling method) and a personalized scheduling strategy depending on a large model, and supports an algorithm designed by a user to be dynamically added; and an interface supporting multi-instance deployment is realized.

In this embodiment, the task scheduling unit is configured as a pluggable load balancing module, specifically a LoadBalance abstract class, which includes a mapping method, and logic for calling the mapping method to implement the selection inference engine module worker is implemented in the task scheduling unit of the controller module controller. Based on a load balancing strategy of a model application deployment system, selecting examples of different reasoning engine modules works according to the number of the token of a large model, giving out the number of the token required to be output by the model in advance when inputting, and selecting a model with the minimum number of the token according to the number of the token. Wherein LoadBalance abstract classes can be custom implemented by the user according to their specific needs.

Optionally, the load balancing algorithm of the task scheduling unit can be dynamically adjusted according to the real-time system load condition, so as to ensure the load balancing of each reasoning engine module and the optimization of the resource utilization rate. In addition, an adaptive regulation strategy is introduced, and automatic optimization is performed according to the running state and performance index of the system, so that the performance and stability of the system are further improved

Optionally, the inference engine module worker includes:

The model loading unit is used for realizing a model loading worker ini method and is used for receiving parameters of the path of the reasoning framework model, calling loading functions of each reasoning framework model and finishing the loading of the reasoning framework model; internal method for transmitting parameters of different reasoning frame models from reasoning engine module worker to reasoning frame model in transparent transmission mode

The content generating unit is used for realizing a content generating do-work method, the controller module is used for calling the controller, the parameters of different reasoning frame models adopt a transparent transmission mode, and the content generating function of each reasoning frame model is called to realize content generation.

In this embodiment, vllm inference engine module worker, vllm-worker, is implemented by way of example with vllm inference framework:

firstly, an init function of a model loading unit of vllm-worker is called, and the function initializes an engine ASYNCLLMENGINE of vllm according to a model address;

then, the init function of the model loading unit calls a task registration unit register interface of the controller module controller to register vllm-worker according to the address of the controller module controller;

Finally, the controller module controller realizes and exposes the generation method of the vllm-worker content generation unit to the outside, and the method mainly calls the generation method of ASYNCLLMENGINE to finish the content generation.

Optionally, the reasoning engine module includes:

a large language model reasoning engine module (LLM-workbench) which can be one or more and is used for executing large language model reasoning tasks;

A generic-model reasoning engine module (generic-Worker), which may be one or more, is used to perform generic-model reasoning tasks. In this embodiment, a General-purpose model inference engine module (General-Worker) is provided as a Worker for models that are not large models, such as embedding, RLHF, etc., to facilitate unified management and deployment of controllers.

In this embodiment, using embedding inference frameworks as an example, a general-worker of embedding, i.e., embedding-worker, is implemented:

firstly, selecting paraphrase-multilingual-mpnet-base-v2 models, and selecting sentence _ transformers frames to load the models when loading the models;

Then, the init function of the model loading unit also calls a register interface register embedding-worker of a task registering unit of the controller module controller according to the address of the controller module controller;

finally, the do_work method of the content generating unit realizes the function of embedding by calling the encode method of sentence _ transformers and returns.

The method has the advantages that the method benefits from the support of the reasoning engine module to different types of models, and unified management of the multiple types of models is realized, so that a set of system architecture really available in a production environment can be quickly built, management complexity and maintenance cost are reduced, and the building efficiency of large model application is improved.

Optionally, the model loading unit and the content generating unit of the reasoning engine module adopt the technical means of asynchronous loading and parallel computing, thereby improving the model loading speed and the reasoning efficiency

Optionally, the inference engine module adopts a model caching mechanism to cache common models and parameters, so as to reduce the time cost of model loading and initialization.

Optionally, the model application deployment system further comprises:

And the model storage unit is used for storing the reasoning engine to be deployed. In this embodiment, the inference engine to be deployed includes models of deep learning based large language models and non-large models. Can be realized by adopting various techniques, for example, can adopt HDFS, OSS, S and other storage techniques to store the model

And the model management unit is used for configuring the deployment, updating and unloading of the management model. In this embodiment, the model management unit adopts a front-end technique to realize visualization of the entire model application deployment system. And a visual interface is provided, so that the automatic one-touch deployment of the reasoning model is realized.

Optionally, the model management unit further has a model version management function, and supports version control and rollback of the model so as to timely process the update and rollback requirements of the model. In addition, the system also comprises a model performance monitoring and abnormality detection function, the running state and performance of the model are monitored in real time, abnormal conditions are found and processed in time, and the stability and reliability of the system are guaranteed.

Optionally, the controller module may employ a distributed architecture to improve the scalability and reliability of the system. In this embodiment, for a distributed architecture of controller modules, kubernetes, mesos, YARN or the like may be employed.

Alternatively, the inference engine module may be deployed on a different hardware platform, and may choose, for example CPU, GPU, FPGA, depending on the computational requirements of the model. For a model with high requirement on computing resources, a hardware platform such as a GPU or an FPGA can be adopted.

As shown in FIG. 1, the method for deploying the model application provided by the invention comprises the following steps:

Deploying the model to an inference engine module;

Registering an inference engine module in the controller module;

the user sends a model reasoning request to the controller module;

The task sending unit of the controller module sends the model reasoning request to the selected reasoning engine module;

In this embodiment, a model application deployment method includes the steps of:

deploying the model to an inference engine module worker;

Registering an inference engine module worker at a task registration unit register interface in a controller module controller;

the user sends a model reasoning request to a controller module controller;

A task scheduling unit scheduling object of a controller module controller selects a proper reasoning engine module according to the load condition of a reasoning engine module worker;

The task receiving unit of the controller module controller sends a model reasoning request to the selected reasoning engine module worker;

The content generation unit of the reasoning engine module worker realizes a content generation do-work method, executes a model reasoning task and generates a model reasoning result;

The reasoning engine module worker returns a model reasoning result to the controller module controller;

The task return unit of the controller module controller returns the model reasoning result to the user.

Alternatively, the models include a large language model and a generic model.

Optionally, the inference engine module includes a large language model inference engine module (LLM-Worker) and a generic model inference engine module (General-Worker).

Alternatively, the task scheduling unit of the controller module may select an appropriate inference engine module using a load balancing algorithm.

Alternatively, the model management unit of the controller module may update or uninstall the inference engine module according to the evaluation result.

As shown in fig. 2, the present invention provides a model application deployment system, which includes:

The inference engine module model worker comprises a general model inference engine module and a large language model inference engine module llm worker, and can be one or more of the model inference engine modules for executing model inference tasks and generating model inference results.

The general model reasoning engine module general worker is adapted to the worker of the non-large model, such as SIMILARITY WORKER module, embedding worker module and RLHF worker module, so that unified deployment and management of various small models are realized; the large language model inference engine module llm worker adapts to the works of the large model, such as VLLM worker, TGI works and HF works, and realizes unified deployment and management of various inference architectures.

Optionally, the deployment system further comprises a front-end module for visually completing the deployment of the model application.

3-5, The front-end page of the model application deployment system provided by the invention is a visual realization model application deployment method, which comprises the following steps:

the front-end module of the model application deployment system provides a model application deployment model list interface, as shown in fig. 3, where the model application deployment model list interface includes deployable model information, such as information of model identification, deployment method, model name, model path, server, port, and the like. In this embodiment, part of the information is hidden in the graph, for example, the path, server and port information corresponding to the model may be configured and displayed according to actual needs, and the present invention is not limited in particular.

Responding to the setting of a user on a model application deployment model list interface to obtain model information to be deployed, matching corresponding hardware resources such as GPU information, and completing the configuration of the model deployment information based on a load balancing strategy setting deployment method of a task scheduling unit, namely a deployment object unit, as shown in a deployment model configuration interface in FIG. 4.

In the embodiment, the deployment method for realizing the front-end page of the model application deployment system solves the design and realization problems of the front-end page of the model application deployment system. Conventional model deployments typically require configuration and operation through command lines or programming interfaces, which may be not sufficiently friendly for non-technicians to understand and operate. Therefore, a visual front-end page is designed to finish model application deployment, and the usability and the operation convenience of a user can be improved. By means of the technical scheme, a user can complete deployment of the model application through an intuitive graphical interface without deep knowledge of the underlying technical details. The visual design can greatly reduce the learning cost and the operation difficulty of the user and improve the working efficiency of the user.

In this embodiment, the problem of hardware resource matching and other parameter settings during model configuration is solved. In the deployment process of model application, proper resources are generally required to be selected according to different hardware environments, and some additional parameters may be required to be configured to meet specific requirements. By means of the technical scheme, a user can flexibly select matched hardware resources in the model deployment process and configure other needed parameters according to actual requirements. The flexibility and the customizable performance can better meet the requirements of different users, and the applicability and the flexibility of the system are improved.

Optionally, after the model deployment is completed, the api interface access model corresponding to the deployed model can be directly called.

In this embodiment, the problem of how to conveniently invoke the deployed model after the model is deployed is solved. In actual practice, the user may need to access the deployed model through an API interface for reasoning tasks or other operations. By means of the technical scheme, a user can directly access the deployed model through a simple API calling mode, and complex configuration and operation flow are not needed. The convenient access mode can accelerate the development and deployment of the application and improve the usability and efficiency of the system.

Optionally, after the model deployment is completed, as shown in the model test interface in fig. 5, a visual test can be performed on the deployed model by selecting test parameters and inputting test contents at the front end.

In this embodiment, the problem of how to perform effective tests to verify model performance after model deployment is solved. In practice, a user may need to test a deployed model and analyze and evaluate the test results. By means of the technical scheme, a user can select proper test parameters through the model test interface and input test contents, and the deployed model is subjected to visual test. The visual testing mode can intuitively display the performance and effect of the model, and helps a user to better understand the behavior and performance of the model.

The above embodiments may be combined to obtain other embodiments. The division of the embodiments in the present application is only for convenience of description, and should not be construed as a specific limitation, and the contents of the embodiments may be combined with each other in a logic-compliant manner.

The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention without requiring creative effort by one of ordinary skill in the art. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims

1. A model application deployment system, comprising:

and a controller module for managing and scheduling the inference engine module.

2. The model application deployment system of claim 1, wherein the inference engine module comprises:

3. The model application deployment system of claim 1, wherein the controller module comprises:

The task receiving unit is used for receiving a model reasoning request from a user outside the system;

the task scheduling unit is used for selecting a proper reasoning engine module to execute a model reasoning task according to the load condition of the reasoning engine module;

and the task registration unit is used for receiving the registration information of the reasoning engine module and maintaining the list for storing the information of the reasoning engine module.

4. The model application deployment system of claim 1, further comprising:

the model storage unit is used for storing the reasoning engine to be deployed;

And the model management unit is used for configuring and managing the deployment, updating and unloading of the reasoning engine.

5. The model application deployment system of claim 1 wherein the controller module may employ a distributed architecture to improve system scalability and reliability.

6. A method for deploying a model application, comprising the steps of:

Deploying the model to an inference engine module;

Registering an inference engine module in the controller module;

the user sends a model reasoning request to the controller module;

7. The model application deployment method of claim 6 wherein the model comprises a large language model and a generic model.

8. The model application deployment method of claim 6 wherein said inference engine modules include a large language model inference engine module (LLM-Worker) and a generic model inference engine module (General-Worker).

9. The model application deployment method of claim 6 wherein the task scheduling unit of the controller module can select the appropriate inference engine module using a load balancing algorithm.

10. The model application deployment method of claim 6, wherein the model management unit of the controller module can update or uninstall the inference engine module according to the evaluation result.