CN117196000A

CN117196000A - Edge side model reasoning acceleration method for containerized deployment

Info

Publication number: CN117196000A
Application number: CN202311201460.6A
Authority: CN
Inventors: 姚欣; 王慧玉; 王晓飞
Original assignee: Pioneer Cloud Computing Shanghai Co ltd
Current assignee: Pioneer Cloud Computing Shanghai Co ltd
Priority date: 2023-09-18
Filing date: 2023-09-18
Publication date: 2023-12-08

Abstract

The application relates to the technical field of computers, in particular to a method for accelerating reasoning of a containerized deployment edge side model, which is characterized by comprising the following steps: s1: the training model is converted according to the format model and then is output to a conversion model, a model network is established according to the structure of the conversion model, and the conversion model of the model network is subjected to quantization operation and pruning operation to output a model structure; s2: finishing data preprocessing on the model structure through data type conversion and data format conversion; s3: generating an initialization model by initializing a model structure, and obtaining a processing model and processing data after processing the initialization model to complete an inference engine model; s4: the inference acceleration service is deployed through a containerization tool such that model structure, data preprocessing, and inference engine models execute in containers. The method solves the problems of precision loss, single reasoning flow, complexity in deployment and performance degradation possibly caused in a multi-mode scene.

Description

Edge side model reasoning acceleration method for containerized deployment

Technical Field

The application relates to the technical field of computers, in particular to a method for accelerating reasoning of an edge side model of containerized deployment.

Background

With the rapid development of the 5G and the Internet of things, challenges such as high computing demands and high delay brought by mass data transmission are faced when complex model reasoning is carried out on a traditional cloud computing architecture, service instantaneity and user experience are affected, cloud model reasoning possibly relates to data security crisis, an edge service architecture has the characteristics of instantaneity, data privacy protection, security and the like, cloud pressure is relieved, network bandwidth is saved, and the cloud computing architecture is suitable for application scenes of the communication industry. The architecture helps to improve system performance, user experience, and provides support for innovations in various fields.

The traditional model reasoning acceleration framework currently existing has the following problems: loss of precision: some acceleration techniques (such as quantization, pruning, etc.) may introduce loss of accuracy during the inference process, thereby affecting the performance and accuracy of the model; the reasoning process is single: the modularization degree in the model reasoning process is lower, the tight coupling degree between tasks is higher, the utilization rate of a CPU and a GPU is insufficient, in addition, the memory strategy of a server is relatively lagged, and the waste of memory resources is caused, so that the reasoning speed is slowed down; deployment complexity: applying a model reasoning acceleration framework to an actual scenario may require complex tasks such as model deployment, integration, and tuning, which require expertise and time investment; dynamic scene adaptation: in a real-time dynamic scenario, model reasoning needs to adapt to changing input data, however, some acceleration techniques may lead to performance degradation problems in a multi-modal scenario.

Disclosure of Invention

The deployment of the model on the edge service architecture can provide real-time data processing and decision making capabilities, thereby overcoming the above-mentioned computing resource limitations, network delays and data privacy problems in traditional cloud computing.

The application aims to solve the defects in the prior art, and provides a method for accelerating reasoning of an edge side model of containerized deployment, which comprises the following steps:

s1: converting a training model according to a format model, outputting a conversion model, establishing a model network according to the structure of the conversion model, and outputting a model structure by carrying out quantization operation and pruning operation on the conversion model of the model network;

s2: finishing data preprocessing on the model structure through data type conversion and data format conversion;

s3: generating an initialization model by initializing the model structure, and obtaining a processing model and processing data after processing the initialization model to complete an inference engine model;

s4: deploying an inference acceleration service through a containerization tool such that the model structure, the data preprocessing, and the inference engine model execute in the container;

further, the conversion model further includes:

the training model is converted according to the format of the format model, and the format of the conversion model is the format of the format model.

Further, the step S1 further includes:

the model structure is used to verify whether the output model structure, input information and output information are correct by using a verification tool.

Further, the model network further comprises:

s21: analyzing the structure of the conversion model, checking the input information and the output information of the conversion model, and creating an input node and an output node in the model network;

s22: and establishing a network layer and connection in the model network according to the structure of the conversion model, and ensuring the correct transmission of the input information and the output information.

Further, the establishing of the network layer and the connection further includes:

and setting the network layer parameters at the network layer by analyzing the parameter values in the conversion model, and connecting the network layers according to the connection relation of the conversion model.

Further, the quantization operation further includes:

after the input nodes and the output nodes of the model network are inserted with quantization pseudo nodes, the weights and the activation values of the model network are quantized, the quantization model is quantitatively inferred, then an inverse quantization pseudo node is inserted, and finally the quantization model is output.

Further, the pruning operation further includes:

removing redundant parameters generated after pruning operation by pruning the quantized model, and outputting a model structure after fine tuning the removed quantized model.

Further, the data preprocessing further includes:

and converting the data of the model structure according to the data format of the format model, and transmitting the converted data to the reasoning engine model as input information.

Further, the processing the initialization model further includes:

traversing the input request, acquiring an input text from the request, preprocessing the data of the Pipeline, performing forward processing of the Pipeline by using a text model, performing coding and decoding operation on the text mainly by the forward processing, performing data format conversion on the data generated by the coding and decoding operation to obtain the data post-processing of the Pipeline, and performing the data post-processing, namely processing the data.

Further, the containerization tool deploys an inference acceleration service, further comprising:

and placing the model structure, the data preprocessing and the reasoning engine model into the container, configuring the container including mirroring, a monitoring system and a log mechanism, and establishing the client of the request, and realizing communication and data sharing among containers through container configuration.

Compared with the prior art, the application has the beneficial effects that:

(1) And (3) precision assurance: the acceleration technology focuses on the precision quantization of model reasoning parameters, reduces pruning of a model structure, and ensures that model precision loss is within an acceptable range.

(2) The pipeline linear reasoning is introduced: by introducing pipeline reasoning, the reasoning process is divided into a plurality of stages, and the coupling degree of each module of model reasoning is reduced. This optimizes the utilization of the cache in the reasoning process, while enabling unified management of multimodal requests.

(3) And (3) containerized deployment: the adoption of the container tool to deploy the reasoning service reduces the threshold of the model reasoning workers, improves the flexibility of deployment and load balancing, and is particularly suitable for heterogeneous server environments in edge scenes.

(4) Multimodal support: the patent supports a multi-mode model, and shows leading performance under the multi-mode model reasoning acceleration test condition, and the capability is excellent in meeting the user reasoning requirement and improving the data processing capability.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, serve to explain the application.

In the drawings:

FIG. 1 is a model training and reasoning flow diagram of a containerized deployment edge-side model reasoning acceleration method of the present application;

FIG. 2 is a flow chart of model inference acceleration of a method for edge-side model inference acceleration of containerized deployment of the present application;

FIG. 3 is a schematic service deployment scenario diagram of a method for accelerating reasoning of an edge side model of containerized deployment;

FIG. 4 is a block diagram of the storage of the inference files and configuration files in the repository of the edge-side model inference acceleration method of the containerized deployment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Technical terms related to the embodiment of the application are defined as follows:

1. encoder: a deep Convolutional Neural Network (CNN) is typically employed as an encoder for mapping an input image or feature map into a potential space.

2. Diffusion Process: this is the core of the model, and a diffusion process is employed to gradually generate images.

3. Decoder: the decoder is also typically a depth CNN, responsible for restoring the representation in the potential space to the generated image. During diffusion, the decoder generates progressively improved images.

4. Loss Function: models are typically trained using a generated contrast loss (GAN loss) or other suitable loss function that helps ensure similarity between the generated image and the real image.

5. Conditioning: in some cases, the model may introduce conditional information, such as a textual description or other additional information, to guide the process of image generation.

6. Temporal Scheduling: the time step selection and scheduling process in the model determines the manner in which each step in the diffusion process generates the image, including the introduction and adjustment of noise.

7. Docker, a containerized platform, packages applications, libraries, and dependencies into one standardized unit using container technology. Each Docker container operates in an isolated environment, avoiding conflicts and impacts between applications. The Docker container has consistent modes of operation in different environments, making it easier for applications to migrate in development, testing, and production environments. Kubernetes (commonly abbreviated as K8 s) is an open-source container orchestration and management platform for automating the deployment, expansion, management, and operation of containers. Kubernetes provides a service discovery mechanism and load balancing; the failed container instance can be automatically detected and replaced, so that high availability of the application is ensured; and automatically expanding the container examples according to the load condition so as to meet the requirements of different flows. Docker and Kubernetes are commonly used together, with Docker being used to create container images and Kubernetes being used to deploy and manage these container images.

8. The common multi-modal models can be categorized according to their structure and fusion strategy: the method can be classified into Early Fusion, late Fusion, transformer-Based and the like according to structures; classification by fusion policy can be divided into Concatenation, summation, weighted Average, attention-Based, and Cross-model representation, etc., which are just a common way, and in fact, many multimodal models may be combined and innovated between different structures and fusion policies to meet the requirements of a particular task.

Examples

Training of deep learning models is an iterative process aimed at adjusting parameters of the model by data so that it can accurately predict the data. Reasoning of the deep learning model is the process of applying a trained model to new data, generating a prediction result.

Referring to fig. 1, the model training process and the reasoning process differ as follows:

the purposes are different: the training process aims at adjusting model parameters through data to minimize loss, so that the model can be learned from the data; the reasoning process is to apply the trained model to new data to generate a prediction result.

Data use: the training process uses the training data to adjust model parameters; the inference flow uses the new data to generate predictions.

Back propagation: training the flow involves back propagation to calculate gradients and update model parameters; the inference flow involves only forward propagation and does not require backward propagation.

Parameter status: in the training process, model parameters are continuously updated to gradually optimize the model performance; in the reasoning process, the model parameters remain unchanged.

Calculation overhead: training processes are typically more time consuming and computationally intensive than reasoning processes because training involves an iterative process of extensive data and parameter updates.

According to the technical scheme of the edge side model reasoning acceleration method for containerized deployment, the deployment model on the edge service architecture can provide real-time data processing and decision making capability, so that the problems of computing resource limitation, network delay and data privacy in traditional cloud computing are overcome.

Referring to fig. 2, in order to solve the drawbacks of the prior art, the present application provides a method for accelerating edge side model reasoning of containerized deployment, comprising the following steps:

s1: the training model is converted according to the format model and then output to a conversion model, a model network is established according to the structure of the conversion model, and quantization operation and pruning operation are carried out on the conversion model of the model network to output a model structure, specifically, in the embodiment, the format model is an original frame, and the original frame comprises a frame comprising Pytorch, tensorFlow and MXNet, and the following is mainly described in detail by adopting a Pytorch frame;

s3: generating an initialization model by initializing a model structure, and obtaining a processing model and processing data after processing the initialization model to complete an inference engine model;

s4: the inference acceleration service is deployed through a containerization tool such that model structure, data preprocessing, and inference engine models execute in containers.

Further, the conversion model further includes:

the training model is converted according to the format of the format model, the format of the conversion model is the format of the format model, specifically, in this embodiment, in the training model import stage, the training model is first exported from the Pytorch model and converted into the ONNX format, the calculation map of the training model is converted into the ONNX static map containing the calculation and operation of the training model, and the conversion process uses the export tool or library provided by the Pytorch model.

Preferably, the Pytorch model is derived using a torch.onnx.export () function that needs to specify the following: inputting a model; inputting an example, i.e., an example tensor containing input data, for determining the shape and data type of the training model input; the file name stored by the derived ONNX model is specified, and the file contains the model structure, weight and operation.

More preferably, the ONNX specification supports various operations, but not all PyTorrch operations can be directly converted to ONNX operations, in the derivation process, pyTorrch will attempt to map the operations of the model to ONNX operations, if unsupported operations are encountered, operation conversion or custom operations may be required, and in addition PyTorrch will optimize the training model at some graph level, such as fusing neighboring operations, reducing redundancy operations, etc. These optimizations aim to reduce the size and complexity of the converted ONNX model.

Further, step S1 further includes:

the model structure is used to verify whether the output model structure and the input information and the output information are correct by using a verification tool, specifically, in this embodiment, after the training model is converted into the conversion model, the derived ONNX model may be verified whether the model is correct by using a verification tool in the ONNX tool library, including checking the model structure, the input information and the output information.

Further, the model network further comprises:

s21: analyzing the structure of the conversion model, checking the input information and the output information of the conversion model, and creating an input node and an output node in a model network;

s22: and establishing a network layer and connection in a model network according to the structure of the conversion model, and ensuring the correct transmission of the input information and the output information.

Further, the establishment of the network layer and the connection further comprises:

setting network layer parameters at a network layer by analyzing parameter values in a conversion model, and connecting the network layers according to connection relation of the conversion model, specifically, in the embodiment, analyzing the structure of an ONNX model, checking information of input and output nodes of the model, wherein each node comprises name, data type and dimension information, creating corresponding input and output nodes in the model network according to the information so that the model network can correctly accept input data and generate an output result, generating a normative configuration file according to the information, determining which layers are contained in the model according to the structure of the ONNX model, and connecting the layers according to the information, and creating corresponding network layers and connection in the model network.

More preferably, the parameter information in the ONNX model includes weights, convolution kernels, offsets, etc. of the respective layers, and the parameters are set into the corresponding network layers by parsing the parameter values in the ONNX model and applying them to the created network layers, and furthermore, the network layers are connected according to the connection information of the ONNX model in a correct connection relationship, so as to ensure correct information transfer.

Further, the quantization operation further includes:

the quantization pseudo node is inserted into the input node and the output node of the model network, the weight and the activation value of the model network are quantized, the inverse quantization pseudo node is inserted into the model network after the quantization reasoning is performed on the quantization model, and the quantization model is finally output. Introducing a model constructed according to the structure of the ONNX model; inserting a quantization pseudo node between an input node and an output node of the model network using torch. Setting model qconfig to select quantization configuration; performing a normal training process; after training is completed, using torch. Quantification. Conversion to quantify the weights and activation values in the network model; converting the model into a torch script using torch.jit.script to support quantitative reasoning; carrying out quantization reasoning, and carrying out model quantization calculation on input data to obtain quantized output; inserting an inverse quantization pseudo node by using the torch. Quantization. DeQuantStub (), and inversely quantizing the quantized output into the original floating point number output; and (5) carrying out a normal reasoning process to obtain the final quantized model output.

Further, pruning operation, further comprising:

removing redundant parameters generated after pruning operation is performed on the quantized model, and outputting a model structure after fine tuning is performed on the removed quantized model, specifically, in the embodiment, model pruning reduces the size and the calculated amount of the model by reducing the redundant parameters and the connection in the model, and the quantized model is introduced; defining pruning strategies by using functions in a torch.nn.utils.prune module, wherein the pruning strategies comprise pruning ranges (selecting target networks and corresponding parameters), pruning methods and pruning proportions; executing pruning, and calling a prune.remove function to remove redundant parameters generated after pruning operation; after pruning, the model is trimmed to recover the model performance.

More preferably, the technique focuses on precision quantization of model inference parameters, thereby minimizing the influence of pruning of a model structure on precision, and floating point parameters can be precisely mapped to integer representations with lower digits through a fine precision quantization strategy, so that model calculation and memory overhead are reduced. Meanwhile, in order to ensure the performance and accuracy of the model, the technology selects a proper quantization method and parameters according to the characteristics of the model and the requirements of tasks so as to ensure that the precision loss of the model is within an acceptable range.

Further, the data preprocessing further comprises:

the data of the model structure is converted according to the data format of the format model, and the converted data is transmitted to the inference engine model as input information, specifically, in this embodiment, the data type conversion is completed through the preprocessing part, and the data format conversion and the image preprocessing (scaling, clipping, channel adjustment and normalization) are performed.

Preferably, the embodiment can provide a multi-mode model reasoning acceleration service, and the data preprocessing process before large model reasoning is an important step of deep learning model reasoning, which ensures that input data meets the requirements of a model and performs necessary data transformation and normalization before the model reasoning.

Before reasoning, the data of the model structure needs to be converted into the data types which can be processed by the reasoning engine model, typically, the deep learning model needs to be data of a specific type, such as floating point number, when reasoning, if the data type of the input data does not match the data type expected by the model, the data type conversion needs to be performed, the deep learning model generally expects a specific data format as input, for example, a picture model may need to be input as tensor of an image, therefore, the data of the model structure needs to be converted into the data format expected by the reasoning engine model, and if the reasoning engine model is image-related, the image preprocessing is a key step.

This includes the following common operations: scaling: the image is resized to the desired input size of the model because the input size of the model is typically fixed, so the input image needs to be scaled to meet the requirements of the model; cutting: in some cases, the image may be cropped to extract the region of interest; and (3) channel adjustment: the color channels of the image exist in different orders (RGB or BGR), and in the preprocessing process, the channel order of the image needs to be adjusted to the order required by the model; normalization: mapping the values of the image pixels to a fixed range [0,1] or [ -1,1], normalization can help the model to train and infer better; data enhancement: in some cases, data enhancement operations, such as random rotation, flipping, noise addition, etc., may be performed on the image data prior to reasoning. The data enhancement helps to improve the generalization ability of the model, so that the model is more robust to changes in input, and preferably, after the preprocessing step is completed, the data can be transmitted to the model as input data for reasoning. The preprocessing process ensures that the input data matches the expected input of the model, providing the model with the appropriate input, and thus obtaining accurate reasoning results.

Further, for the initializing model processing, further comprising:

traversing an input request, acquiring an input text from the request, preprocessing the data of the Pipeline, performing forward processing of the Pipeline by using a text model, performing encoding and decoding operations on the text mainly by the forward processing, performing data format conversion on the data generated by the encoding and decoding operations to obtain the data of the Pipeline, and performing post-processing of the data, namely processing the data. In this process we use triton inference library functions.

Preferably, the model class instance is initialized using the initialize method. The method comprises the following steps: obtaining an output data type through self.output_dtype; initializing a CLIPTOKENizer instance as a text processing tool; initializing an LMSDisceteschduler instance for scheduling the number of steps of model operation; and initializing a UNet2 DContionmodel instance for image generation. The inference core is done using the execute method. In this method, input requests (requests) are traversed, each request containing an input prompt (i.e., an input text prompt). For each request acquisition input text, word segmentation processing is performed to obtain a tokenized_text (conditional text) and a tokenized_text_uncond (unconditional text). Then, an InformaceRequest is constructed, the text_encoder model is used for processing the inputted tokenized_text, and the encoding vector text_emmbeddings of the text is obtained. After the above process is completed, the scheduler is run to adjust the noise vector over a series of time steps to generate the potential vector. The potential vectors will be used for image generation. And decoding through the latency_sample vector and the vae model to generate a generated image decoded_image. Post-processing the generated image, including normalizing and converting the image, converting the image into a NumPy array, and performing data type conversion on the image. And packaging the processed result into an InnerenceResponse format, and containing the generated image data.

Preferably, the technique introduces pipeline reasoning, decomposing the reasoning process into multiple independent phases, so that other preprocessing operations, such as data format conversion, data enhancement, etc., can be performed while the model is loaded. By the method, not only is the effective utilization of the cache in the reasoning process optimized, but also the efficient and unified management of a plurality of model requests is realized. The staged design not only improves the overall reasoning speed, but also enables parallel processing to be better carried out among different stages, thereby further improving the efficiency of model reasoning.

Referring to fig. 3, the containerization tool deploys an inference acceleration service, further comprising:

by putting the model structure, the data preprocessing and the reasoning engine model into the container, configuring the container including mirroring, a monitoring system and a log mechanism, and establishing a request client, realizing communication and data sharing among the containers through the container configuration, particularly, in the embodiment, the reasoning acceleration service deployment and execution are flexibly and conveniently deploying the reasoning acceleration service in a heterogeneous edge network environment, and the container is used for packaging the reasoning acceleration program. The operations of the model structure, the data preprocessing and the reasoning engine model are all carried out in a container, and the reasoning service deployment is divided into two environments, namely:

(1) Use of a docker deployment model to infer acceleration services on a standalone environment, including

Writing a Dockerfile description of how to construct a docker container; the content in Dockerfile typically contains base images, working directories, dependency libraries, model paths (model repository) and inference service script paths, service ports, and inference service run commands; normalizing the placement model and the configuration file, and using the model configuration file of the model network, wherein the storage structure of the model reasoning file and the configuration file in the warehouse is shown by referring to fig. 4; the method comprises the steps that a Dockerf file operation model is used for reasoning a request server, firstly, a catalog where the Dockerf file is located is entered, and a docker mirror image is built by using a docker build command; using the constructed mirror image, starting an reasoning request port, mapping a mounting model warehouse folder into a container, and operating the container after configuring the options; and constructing a model reasoning request client, testing the request, deploying a stable diffusion v1-5 model under the environment, and constructing a client form through a script to request reasoning.

(2) Reasoning acceleration services using kubernetes deployment model on edge network architecture, including:

using Kubernetes is an extension of the stand-alone environment, we need to set up a Kubernetes cluster that can contain multiple edge devices, each of which can be a Kubernetes node, ensuring that the size and configuration of the cluster is appropriate for your reasoning needs; creating a Kubernetes deployment description file (yaml format) by using the image generated in the single machine environment, defining deployment specifications of model reasoning service, designating used generated images in the description file, designating resource allocation and environment variable configuration, using kubectl command to run the description file to be deployed in a Kubernetes cluster, and automatically creating corresponding Pod and container examples in the cluster; a load balancer is created using Service objects and associated with deployed model inference services. The method can ensure that the reasoning requests are uniformly distributed on each node in the cluster, and provide a service discovery mechanism so that other services can easily find the position of the model reasoning service; by configuring the resource limitations of Kubernetes, you can specify the usage limitations of resources such as CPU, memory, etc. for each model reasoning service to avoid resource contention and performance problems. In addition, affinity rules are set, and affinity binding is carried out on model reasoning service and nodes of a specific type so as to meet specific hardware requirements; communication and data sharing between containers are achieved by using Service Discovery or configuration ConfigMap and Persistent Volume mechanisms; configuring proper security policies, limiting access to model reasoning services, and using identity verification and authorization mechanisms to protect the security of models and data; the monitoring system and the log mechanism are configured to monitor the performance, the running state and the resource use condition of the model reasoning service in real time, rapidly identify and eliminate potential problems, and set an automatic telescoping rule according to the monitoring index so as to automatically adjust the number of deployed copies according to the change of the load.

More preferably, the embodiment deploys the reasoning service by introducing the containerization tool, the technology breaks through the deployment threshold of model reasoning, the containerization technology enables the reasoning service and the dependence environment thereof to be packed into independent containers, thereby realizing consistent deployment in different environments, simplifying the configuration and deployment process, increasing the deployment flexibility, adapting to the requirements of different hardware platforms and scenes, and having remarkable advantages especially for heterogeneous server architecture in edge computing environments.

In summary, the technique supports a multi-modal model in model reasoning, enabling different types of data (e.g., images and text) to be jointly processed. By effectively inputting multimodal data into the model, the technological breakthroughs further improve the data processing capabilities of the model. Under the condition of multi-mode model reasoning acceleration test, the technology shows leading performance, and can better meet the requirements of multi-mode data processing in the real world. This capability not only helps to improve the overall performance of the model, but also enables more comprehensive capture of associations and information between data.

Finally, it should be noted that: the above is only a preferred embodiment of the present application, and the present application is not limited thereto, but it is to be understood that the present application is described in detail with reference to the foregoing embodiments, and modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. The edge side model reasoning acceleration method for containerized deployment is characterized by comprising the following steps of:

s4: an inference acceleration service is deployed through a containerization tool such that the model structure, the data preprocessing, and the inference engine model execute in the container.

2. The method for acceleration of edge-side model reasoning for a containerized deployment of claim 1, wherein the transformation model further comprises:

3. The method for accelerating edge-side model reasoning for containerized deployments of claim 1, wherein step S1 further comprises:

4. The edge-side model inference acceleration method of a containerized deployment of claim 2, wherein the model network further comprises:

5. The edge-side model inference acceleration method of a containerized deployment of claim 4, wherein the network layer and connection establishment further comprises:

6. The method of edge-side model inference acceleration for containerized deployments of claim 4, wherein the quantization operation further comprises:

7. The edge-side model inference acceleration method of a containerized deployment of claim 6, wherein the pruning operation further comprises:

8. The edge-side model inference acceleration method of a containerized deployment of claim 6, wherein the data preprocessing further comprises:

9. An edge-side model inference acceleration method for a containerized deployment of claims 1 and 8, wherein the inference engine model further comprises:

10. The edge-side model inference acceleration method of containerized deployment of claim 1, wherein the containerized tool deploys inference acceleration services, further comprising: