CN113902116A

CN113902116A - Deep learning model-oriented reasoning batch processing optimization method and system

Info

Publication number: CN113902116A
Application number: CN202111151184.8A
Authority: CN
Inventors: 刘杰; 张衡; 王帅; 吴怀林; 王宗成; 叶丹
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-07

Abstract

The invention discloses a deep learning model-oriented reasoning batch processing optimization method and system. The system comprises a load container batch processing analysis tool module, a batch processing combination tool module and an algorithm service calling module; the load container batch processing analysis tool module is used for storing the inference input parameters in the inference service request into a preprocessing data set, generating training data to perform performance test on the models in each container, and then determining the optimal parameters according to the test result indexes; and the batch processing merging tool module predicts and acquires the inference service request of the next time window according to the optimal parameters to generate a batch processing task and sends the batch processing task to a corresponding container for execution. The invention optimizes the deep learning inference service performance under the architecture without the server, effectively utilizes the multi-core parallel computing capability, and can efficiently carry out the high-concurrency inference service request, thereby greatly optimizing the resource utilization rate, the task execution delay and the throughput.

Description

Deep learning model-oriented reasoning batch processing optimization method and system

Technical Field

The invention relates to a deep learning model-oriented reasoning batch processing optimization method and system, and belongs to the field of computer artificial intelligence and cloud computing.

Background

With the rapid development of emerging technologies such as 5G, Internet of things, big data, cloud computing and the like, artificial intelligence technology is in force and is becoming the decisive force for promoting human beings to enter the intelligent era. The global industry fully realizes the great significance of leading a new round of industrial change by the artificial intelligence technology, and innovations on ecology by the artificial intelligence technology are developed in a way of conversion and layout. With the rapid development of artificial intelligence industry, further maturity of technology and increasing investment of government and industry, cloud-end of artificial intelligence application will be continuously accelerated, and increasingly prominent social influence will be brought. As a core force of a new technological revolution and industrial change, the traditional industry is being promoted to upgrade and upgrade, the rapid development of the unmanned economy is driven, and positive effects are generated in the civil fields of intelligent transportation, intelligent home, intelligent medical treatment and the like.

Machine Learning (Machine Learning) is a core method for implementing artificial intelligence. The process of processing data by computer analysis, finding intrinsic rules using algorithms such as classification, regression, clustering, and then making predictions or decisions about events. Deep Learning (Deep Learning) is a technical means for realizing machine Learning, excellent performance is shown in the aspect of feature extraction, and useful features are learned by constructing a neural network with a plurality of hidden layers and then performing massive data training, so that the accuracy of classification or prediction is finally improved. With the continuous maturation of deep learning technology, a series of more classical and general algorithms and models are formed in the fields of computer vision, image processing, natural language processing and the like, such as AlexNet and VGGNet in the field of image processing, and BERT and Transformer in the field of natural language processing. Therefore, more and more algorithm developers deploy the trained deep learning algorithm model to issue reasoning service, and the method becomes the basis of many scientific research works.

Meanwhile, the server-free architecture becomes a new cloud architecture paradigm along with the development of cloud computing, a service mode which is called by a user according to needs is provided, the task management details of the instance are hidden for the user, and the user can issue the service only by providing the function and the trigger event thereof. A large number of applications are currently deployed on server-less platforms, such as intelligent transportation systems, internet of things frameworks, subscription services, video/image processing, and deep learning related services.

Generally, a function algorithm in a server-free architecture is biased to short and reusable code blocks, and has the characteristics of short life cycle, stateless and no long connection service. In deep learning, the trained algorithm can be exported to be a model file, the function code only needs to provide the realization of loading a model and reasoning functions, the code amount is relatively small, the reasoning process has the characteristic of no state, the execution time is often in the second level in some low-delay reasoning tasks, long connection service is not needed to be provided, and in a use scene, the deep learning algorithm meets the requirement of the function without a server. Therefore, a great opportunity is provided for deploying the deep learning algorithm in the server-free architecture, the trained deep learning algorithm model is deployed in the server-free architecture in a service publishing mode, inference prediction service is provided, and repeated development and deployment work of the algorithm is reduced. The following challenges are mainly present:

in the process of deploying the deep learning model inference service in a serverless architecture, three challenges are faced: (1) how to respond well to a sudden inference service request; (2) how to carry out reasoning service quickly and with low time delay and ensure the overall throughput of the system; (3) how to automatically identify the optimal deep learning model resource allocation ensures that the utilization rate of the system is maximized under reasonable resource allocation and reduces the overall reasoning of the system.

Disclosure of Invention

The invention provides a deep learning model-oriented reasoning batch processing optimization method and system aiming at the deep learning reasoning problem, and the deep learning algorithm model can accelerate batch processing according to hardware computing resources during reasoning calculation, so that a load container batch processing parameter analysis method is provided to obtain the optimal batch processing parameters of service, and a batch processing merging algorithm of a self-adaptive sliding window is provided to improve the performance of real-time online reasoning. The system architecture of the present invention is shown in fig. 1.

The technical scheme adopted by the invention is as follows:

a deep learning model-oriented reasoning batch processing optimization method comprises the following steps:

1) acquiring an online reasoning service request through a request interceptor, and storing reasoning input parameters in the reasoning service request into a preprocessing data set;

2) dividing the inference service request into a CPU type task and a GPU type task according to the type of resources occupied by the inference service request; setting a plurality of containers aiming at the CPU type task, wherein different containers have different memory sizes and CPU core numbers; aiming at the GPU type task, generating a plurality of GPU type task containers with set memory size and GPU kernel number;

3) aiming at the reasoning service of the same model, if the reasoning service is a CPU type task, generating training data sets with different batch processing sizes aiming at the model according to the reasoning input parameters in the preprocessing data set, inputting the training data sets into each container to perform performance test on the model, and then generating a load performance table according to the test result index; then, comparing the batch processing combined execution time Batchtime with the uncombined execution time NoBatchtime in the load performance table, and screening records in the load performance table according to a comparison result of the ratio and a set threshold value delta; then selecting a plurality of records with the highest batch processing frequency from the screened records, and selecting the record with the smallest memory from the records; if the inference service is a GPU type task, generating training data sets with different batch processing sizes according to inference input parameters in the preprocessing data set, inputting the training data sets into a GPU type task container to perform performance test on the model, and selecting a record with the largest batch processing size under the condition of meeting a threshold index;

4) according to the selected record of step 3)Determining an optimum parameter, i.e. batch size Y in the selected record_batchAnd batch execution time T_batch；

5) The inference service requests received in real time are cached in a task cache queue, and then a workload aggregator executes a time T according to the batch processing_batchDetermining a time interval, and counting inference service request quantity in each time interval in a task cache queue to obtain a time sequence;

6) intercepting the inference service requests in the latest period of time from the time sequence for prediction to obtain the service request quantity in the next time interval and inputting the service request quantity into a self-adaptive window algorithm model, adjusting the left and right boundaries of the current window according to input information by the self-adaptive window algorithm model, then taking the inference service requests in the window from the task cache queue to generate batch processing tasks and sending the batch processing tasks to corresponding containers for execution according to the types of resources occupied by the inference service requests.

Further, the method for predicting the number of service requests, the optimal batch size, and the optimal batch execution time in the next time interval includes: setting the time sequence as { X1, X2, X3, … Xt }, wherein Xt is an inference service request of the t-th time interval, and setting alpha as a smoothing coefficient; according to recurrence relation

Calculating a quadratic exponential smoothing value, wherein S_i ⁽²⁾Is the quadratic exponential smoothing value of Xi, S_i ⁽¹⁾Is the first exponential smoothing value of Xi; then predicting the number of service requests Y in the next time interval_t+1＝a_t+b_t，

Further, the method for adjusting the left and right boundaries of the current window by the adaptive window algorithm model according to the input information comprises the following steps: first, a window discriminant function is determined

Judging whether the window is enlarged or reducedIs zooming out; then according to the formula T_length+1＝T_length+(P(t)×σ(Z)×T_batch) Determining an adjusted window; wherein the content of the first and second substances,

ΔF＝|Y_t+1-y_t|；T_length+1is the adjusted window length, y_tThe real-time request quantity of t time intervals is represented.

Further, the strategy for adjusting the left and right boundaries of the current window is as follows: w_left+1＝W_left+T_finished，W_right+1＝W_left+1+T_length+1(ii) a Wherein, W_left+1、W_right+1Is the adjusted left and right boundaries, T_finishedThe length of time that the batch request has completed within the window.

Further, if the corresponding ratio is recorded

The record is retained.

A deep learning model-oriented reasoning batch optimization system is characterized by comprising a load container batch analysis tool module and a batch merging tool module; wherein

The load container batch processing analysis tool module is used for acquiring an online reasoning service request through the request interceptor and storing reasoning input parameters in the reasoning service request into a preprocessing data set; then dividing the inference service request into a CPU type task and a GPU type task according to the type of resources occupied by the inference service request; setting a plurality of containers aiming at the CPU type task, wherein different containers have different memory sizes and CPU core numbers; aiming at the GPU type task, generating a plurality of GPU type task containers with set memory size and GPU kernel number; then, aiming at the reasoning service of the same model, if the reasoning service is a CPU type task, generating training data sets with different batch processing sizes aiming at the model according to the reasoning input parameters in the preprocessing data set, inputting the training data sets into each container to test the performance of the model, and then, carrying out performance test on the modelGenerating a load performance table according to the test result index; then, comparing the batch processing combined execution time Batchtime with the uncombined execution time NoBatchtime in the load performance table, and screening records in the load performance table according to a comparison result of the ratio and a set threshold value delta; then selecting a plurality of records with the highest batch processing frequency from the screened records, and selecting the record with the smallest memory from the records; if the inference service is a GPU type task, generating training data sets with different batch processing sizes according to inference input parameters in the preprocessing data set, inputting the training data sets into a GPU type task container to perform performance test on the model, and selecting a record with the largest batch processing size under the condition of meeting a threshold index; an optimal parameter, batch size Y in the selected record, is then determined based on the selected record_batchAnd batch execution time T_batch；

A batch processing merging tool module for caching the inference service request received in real time into a task cache queue, and then the workload aggregator executes the time T according to the batch processing_batchDetermining a time interval, and counting inference service request quantity in each time interval in a task cache queue to obtain a time sequence; then, the inference service requests in the latest period of time are intercepted from the time sequence for prediction, the service request quantity in the next time interval is obtained and input into an adaptive window algorithm model, the adaptive window algorithm model adjusts the left and right boundaries of the current window according to input information, then the inference service requests in the window are taken out from the task cache queue to generate batch processing tasks, and the batch processing tasks are sent to corresponding containers for execution according to the types of resources occupied by the inference service requests.

The load container batch processing parameter analysis method adopts a mode of 'first service and then strategy' to enable an algorithm model to be on-line for a period of time first, service request data of services are acquired through request interception, service requests of different container environments and different batch processing sizes are automatically generated, then load performance test is carried out to generate a load performance table of the algorithm model, and proper container resource allocation, batch processing sizes and batch processing execution time are selected through a parameter analysis method. As shown in fig. 2, the method comprises the following steps:

intercepting a deep learning inference service request: the HTTP request (i.e. the inference service request) which has been served online is forwarded by the request interceptor, thus obtaining the parameters of this request. The HTTP parameters of the inference service request are stored in a request parameter data set, such as request sending time and response time, and the inference input parameters are stored in a preprocessing data set to generate a training data set.

Generating containers of different resource sizes by a container generator: and dividing the tasks into CPU type tasks and GPU type tasks according to the type of the inference service request tasks and the type of occupied resources. Aiming at the CPU type task, dividing the CPU type task into a plurality of groups of containers according to different memory sizes and CPU core numbers, wherein the memory sizes and the CPU core numbers are in accordance with various specifications which are preset; and directly generating a GPU type task container to a specific memory and GPU kernel number aiming at the GPU type task.

Generating training data sets with different batch processing sizes by reasoning input parameters: and generating a plurality of data sets for the same model by formulating batch processing with different sizes, wherein the data sets are used as input parameters and are put into corresponding containers for execution.

Further, a method for performing a load performance test: aiming at the CPU type task, respectively carrying out performance test on batch processing data sets with different sizes in different containers for the reasoning service of the same model, then storing test result indexes such as memory size, batch processing execution time and incoordination execution time, and generating a load performance table according to the indexes. Aiming at the GPU type task, generating training data sets with different batch processing sizes according to inference input parameters in the preprocessing data set, inputting the training data sets into a GPU type task container to perform performance test on the model, and selecting a record with the largest batch processing size under the condition of meeting a threshold index; an optimal parameter, batch size Y in the selected record, is then determined based on the selected record_batchAnd batch execution time T_batch；。

Further, the batch parameter analysis method comprises the following steps:

the ratio of the batch processing merged execution time Batchtime and the uncombined execution time NoBatchtime in the load performance table is compared with a set threshold value delta,

a comparison is made to select the appropriate record. The selection rule is as follows:

and (3) further screening the records meeting the threshold value aiming at the CPU type task, wherein at the moment, a plurality of groups of container records meet the condition and need to be continuously selected in the rest container records, at the moment, the selection rule is a candidate record set with the largest batch processing size and the largest occurrence frequency in the rest container records, and then the record with the smallest memory is selected as a final result. Aiming at the GPU type task, the method only needs to select the record with the largest batch processing size under the condition of meeting the threshold index, and the batch processing parameter is used as an important parameter in a subsequent self-adaptive sliding window algorithm.

A batch processing combination algorithm based on a self-adaptive sliding window is used for carrying out combination optimization on deep learning inference service requests and mainly comprises a time sequence prediction algorithm and a self-adaptive sliding window adjusting algorithm of the inference service requests. As shown in fig. 3.

The inference service time sequence prediction algorithm is used for providing conditions for subsequent inference service combination by a user, and comprises three parts: task buffer queue, request time sequence, time sequence prediction and prediction result evaluation.

And (3) task buffer queue: the task cache queues cache according to the arrival sequence of the inference service requests, wherein the queues need to mark the arrival time of each request, and because the existing non-service platform does not support the caching of the service requests, the task cache queues need to be additionally arranged outside the non-server to start batch processing support for the inference service.

Furthermore, the time sequence is requested, the generation of the time sequence needs to count the service request data volume in a fixed time interval, the size of the interval determines the granularity of time sequence data statistics, different rule characteristics can be reflected, and the time span of algorithm service prediction is directly determined. Because different algorithms have different service states and the statistical granularity of the time sequence is different, the recorded batch processing execution time obtained in the load container batch processing parameter analysis method stage is selected as a reference standard to generate the request time sequence.

Furthermore, in time series prediction, a quadratic exponential smoothing algorithm is selected for service prediction, and in essence, historical data is subjected to weighted average to serve as a prediction result of future time, and data at different times are subjected to unequal weight processing and are close to real data. Thereby obtaining the predicted value and the trend size of the change of the service.

And the visualization module is used for carrying out chart display on the process of the automatic machine learning module and is responsible for collecting and sorting the process index data and the progress data.

Furthermore, the prediction evaluation index evaluates the predicted result through the RMSE and the MAE, and judges the predicted effect through the error value of the predicted result.

An adaptive sliding window adjustment algorithm. Through the load container parameter analysis method and the time sequence prediction work, the basic work of dynamically merging the requests of the model is completed, under a real working load scene, the size of the self-adaptive sliding window is dynamically adjusted according to the trend change of the service request quantity by a self-adaptive sliding window adjusting algorithm, so that the merging according to the size of flow under different window sizes is met, and the performance of reasoning service is prompted, and the method mainly comprises the following steps:

judging the trend change of the service request, and judging whether the service request data is increased or decreased at the current time point through a P (t) window discrimination function, wherein if the service request data is increased, the corresponding window is enlarged; if it is a decrease, the corresponding window becomes smaller.

Further, a window adjustment service is determined. And introducing the batch size and batch execution time parameters acquired in the batch parameter analysis stage of the load container parameters.

Further, an increment Δ F of the predicted flow amount is determined, and then a ratio Z of the predicted flow amount increment Δ F to the optimal batch size is calculated.

Further, Z is optimized through a Sigmoid function to obtain sigma (Z)

Further, the window discriminant functions P (T), sigma (Z) are multiplied by the batch execution time to obtain the current time window size T_lengthAccumulating to obtain new window size T_length+1。

Further, left and right boundaries of the window are determined, wherein the left boundary moving strategy is that the current left boundary is added with the completed service time to form a new left boundary, and the right boundary is the new left boundary plus T_length+1As a new bounded size.

Compared with the prior art, the invention has the advantages that: since the selection of windows is time sensitive, the window size directly affects the efficiency of batch merging. Conventional methods for window size selection are sliding windows based on a fixed number of requests and sliding windows based on a fixed time. The window sizes of the two methods are limited under specific conditions, the requirement of dynamic change of time series data in a short period cannot be met, the self-adaptive window size in the method can be automatically scaled according to the flow size, and the adjusted time window size is used as the left boundary and the right boundary of service request combination, so that different flow sizes have different window sizes at different request moments. The method is characterized in that the size of a window is adjusted in a self-adaptive manner, and the largest difference between the size of the window and the size of the window is that a plurality of batches are allowed in the window by a batch merging algorithm of a self-adaptive sliding window, namely when the window is expanded and a request in the window is larger than the maximum batch size, the window is divided into a plurality of batches according to a greedy algorithm to be called, and when a residual request exists in the window, the window is merged into the next window.

Drawings

FIG. 1 is a diagram of the overall architecture of the batch optimization system of the present invention.

FIG. 2 is a flow chart of batch parameter analysis in accordance with the present invention.

FIG. 3 is a flow chart of batch process consolidation.

FIG. 4 is a diagram of a batch merge architecture.

FIG. 5 is a flow chart of platform algorithm release.

Fig. 6 is a flow chart of OpenFaaS deployment invocation.

Detailed Description

The technical scheme of the invention can be shown as figure 1, and mainly comprises the following steps: a load container batch processing analysis tool module, a batch processing combination tool module and an algorithm service calling module. Through the cooperative work of the two modules, the batch processing optimization technology and system of the degree learning model reasoning provided by the invention can be realized.

Among the above modules, the load container batch analysis tool module:

as shown in the left branch of fig. 4, the load container batch analysis tool solves the problem of best matching container performance to service performance. The method comprises the steps of storing historical service requests to generate a service request data set, dividing input parameters in the service request data set into sets with different Batch processing sizes by a Batch generator, constructing containers with different memory sizes by the container generator, automatically adjusting the parameters of spec, contacts, resources, request, memory in an OpenFaaS configuration file pod.yaml, and deploying functions through faas-cli deployment-f.

And then the load trainer deduces the data of different batchs in the Batch training set sent by the simulated service request, after each Batch reasoning is finished, the load trainer simultaneously needs to start multithreading asynchronization for each record in the Batch data to carry out reasoning to test the service performance which is not subjected to Batch processing, and counts the service performance indexes, and when the environment container finishes testing all the training sets, the container is destroyed to generate the next large and small memory container to continue the steps. And finally generating a load performance table, wherein the load performance table comprises performance parameters of different batch processing sizes, the parameter selector selects batch processing size and batch processing execution time parameter of the algorithm model in a certain specific memory size through a self-defined algorithm rule, the two parameters are used as key data to guide batch processing merging algorithm of the self-adaptive sliding window in a subsequent real-time flow scene, and the memory size is used as an online container environment of the algorithm model. The concrete implementation is as follows:

wherein the load performance is as follows, and the strategy of parameter selection is as follows

For this module implementation, the following interface configuration is provided.

Interface name	Interface meaning
		getArrayList(<request[input]>)	Returning a data set of service requests
outMultMemoryPod(merory,core)	Then returns a created container
		outMultSizeBatch(size)	Returning a fixed-size batch dataset
collectTrainResult(pod，size)	Returning to a particular container, batch sizeExecution results
		selectBestParam(result)	Returning optimal batch size and memory size

Among the above modules, the batch process merge tool module:

as shown in the right branch of fig. 4, the batch merging tool solves the problem of how to perform batch merging in a real-time scenario, so as to improve service performance. For a real-time service request, Nginx configuration is started to forward the service request to a batch processing merging tool, a task buffer queue performs buffer, the buffer queue is realized by a LinkedBlockingQueue blocking queue, then a workload aggregator counts request tasks in the queue according to a time interval, the time interval selects batch processing execution time generated in a load stage of a service container serving by the algorithm as a reference standard, a time sequence of the service request is generated, and the sequence stores the flow size of current and historical time intervals.

And finally, intercepting a recent inference service request from a time sequence by a time sequence prediction algorithm to start prediction, inputting prediction result parameters obtained by the prediction algorithm, namely the number of service requests and batch processing parameters (the optimal batch processing size and the optimal batch processing execution time) in the next time interval into an adaptive window algorithm model, adjusting the left and right boundaries of the current window by the adaptive window algorithm model, taking the inference service request in the window from a request cache queue to generate a batch processing task, and sending the batch processing task to a corresponding OpenFaaS cluster algorithm container for execution according to the type of resources occupied by the inference service request.

The specific implementation process is as follows: first, a prediction method of a service is implemented. Calculating a once exponential smoothing value, reasoning about the time series { X1, X2, X3, … Xt } of the service request, being the once smoothed value of the i-th phase, Xi being the actual value of the i-th phase, and α being a smoothing coefficient (between 0 and 1, where the smaller the value of α, the more relevant the predicted value is to the recent data)) The recurrence relation is as follows:

a quadratic exponential smoothing value is then calculated, which requires a prediction of the sequence of trending elements by performing a weighted average again. So S_i ⁽²⁾Is the second exponential smoothing value of the i-th stage, S_i ⁽¹⁾Is the first exponential smoothing value of phase i, and the formula is as follows:

when the existing data has a T period in total, the predicted value of T + T can be predicted. If prediction is made on the T +1 period, taking T as 1; y is_t+T＝a_t+b_t×T，

The future trend size of the service request is analyzed through the parameter bt, the future flow size is judged according to the predicted value Yt + T, and the size change of the self-adaptive window is directly determined by the predicted result.

Then the implementation of the adaptive sliding window adjustment algorithm. The first step of the adaptive sliding window adjustment algorithm is to determine a window discrimination function: it is determined whether the window is enlarged or reduced. As shown in the formula, where t represents the predicted current time interval, when bt is greater than 0, it represents that the flow rate increases, and vice versa.

Next, the magnitude of the window adjustment needs to be determined, which introduces the acquisition batch size and the batch execution time parameter in the batch parameter analysis stage of the load container, and the formula is as follows:

T_length+1＝T_length+(P(t)×σ(Z)×T_batch)

ΔF＝|Y_t+1-y_t|

wherein, T_length+1Refers to the length of the window after adjustment; y is_batch,T_batchIs the optimal batch size and execution time obtained by batch parameter analysis; z is the predicted traffic increment accounting for the optimal batch size Y_batchThe ratio of (A) to (B); y is_tRepresenting the real-time request quantity of the previous time interval; σ (Z) is the activation function, here the Sigmoid function is used. The problem that Z values are too large due to rapid increase of service flow is solved, and the Z values are mapped into the range of (0,1) through a Sigmoid activation function. And finally, the position moving strategy of the left and right boundaries of the window is as follows:

W_left+1＝W_left+T_finished

W_right+1＝W_left+1+T_length+1

wherein, W_left+1、W_right+1Refer to the adjusted left and right boundaries; t is_finishedThe length of time that the batch request has completed within the window. From the above formula, the maximum step size of each window adjustment is the execution time length of the optimal batch size in the batch parameter analysis of the load container, but when the flow rate is increased all the time, the window size is extended each time, so the number of requests in the window exceeds the optimal batch size, and at this time, the requests in the window are composed into a plurality of batches for execution. And when the service flow is reduced, reducing the window until the left and right boundaries of the window are overlapped, and not carrying out batch processing and independent calling on each service request.

Interface name	Interface meaning
		addQueue(<request[input]>)	Returning saved service request queues
countTimeInterval(<request[input]>)	Return request time queue
		initAndResizeWindow(ArrayList)	Return to initialized and readjusted Window size
predictTimeSerires(ArrayList)	Parameters for return time series prediction
		Invoking(batch)	Returning batch execution results

Among the above modules, the algorithm service calling module:

as shown in fig. 5 and 6, when the release process enters a stage without a server platform, the method first forwards a request for deploying the algorithm to an OpenFaaS Provider through a Gateway, where the Provider deploys through faas-nets according to a function template written by a user, where the Deployment includes configuring a Docker mirror image, generating resource components such as delivery, Service, and Secret of kubernets, deploying the algorithm to a Pod container, then releasing the function as a Service, and performing CRD management on the resource components through an OpenFaaS Operator. When the algorithm calls the service, the request route firstly reaches the Gateway, the request is forwarded to the Watchdog monitor of the 8080 port in the container, and HTTP request information is transmitted through stdin and stdout to complete the calling of the service.

Although specific embodiments of the invention have been disclosed for purposes of illustration, and for purposes of aiding in the understanding of the contents of the invention and its implementation, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A deep learning model-oriented reasoning batch processing optimization method comprises the following steps:

4) determining an optimum parameter, i.e. batch size Y in the selected record, based on the selected record of step 3)_batchAnd batch execution time T_batch；

2. The method of claim 1, wherein the predicting the number of service requests, the optimal batch size, and the optimal batch execution time for the next time interval comprises: setting the time sequence as { X1, X2, X3, … Xt }, wherein Xt is an inference service request of the t-th time interval, and setting alpha as a smoothing coefficient; according to recurrence relation

3. The method of claim 2, wherein the adaptive window algorithm model adjusts the left and right boundaries of the current window based on the input information by: first, a window discriminant function is determined

Judging whether the window is enlarged or reduced; then according to the formula T_length+1＝T_length+(P(t)×σ(Z)×T_batch) Determining an adjusted window; wherein the content of the first and second substances,

4. The method of claim 1, 2 or 3, wherein the strategy for adjusting the left and right boundaries of the current window is as follows: w_left+1＝W_left+T_finished，W_right+1＝W_left+1+T_length+1(ii) a Wherein, W_left+1、W_right+1Is the adjusted left and right boundaries, T_finishedThe length of time that the batch request has completed within the window.

5. A method according to claim 1, 2 or 3, characterized in that if the corresponding ratio is recorded

The record is retained.

6. A deep learning model-oriented reasoning batch optimization system is characterized by comprising a load container batch analysis tool module and a batch merging tool module; wherein

Load containerThe batch processing analysis tool module is used for acquiring an online reasoning service request through the request interceptor and storing reasoning input parameters in the reasoning service request into a preprocessing data set; then dividing the inference service request into a CPU type task and a GPU type task according to the type of resources occupied by the inference service request; setting a plurality of containers aiming at the CPU type task, wherein different containers have different memory sizes and CPU core numbers; aiming at the GPU type task, generating a plurality of GPU type task containers with set memory size and GPU kernel number; then, aiming at the reasoning service of the same model, if the reasoning service is a CPU type task, generating training data sets with different batch processing sizes aiming at the model according to reasoning input parameters in the preprocessing data set, inputting the training data sets into each container to perform performance test on the model, and then generating a load performance table according to a test result index; then, comparing the batch processing combined execution time Batchtime with the uncombined execution time NoBatchtime in the load performance table, and screening records in the load performance table according to a comparison result of the ratio and a set threshold value delta; then selecting a plurality of records with the highest batch processing frequency from the screened records, and selecting the record with the smallest memory from the records; if the inference service is a GPU type task, generating training data sets with different batch processing sizes according to inference input parameters in the preprocessing data set, inputting the training data sets into a GPU type task container to perform performance test on the model, and selecting a record with the largest batch processing size under the condition of meeting a threshold index; an optimal parameter, batch size Y in the selected record, is then determined based on the selected record_batchAnd batch execution time T_batch；

A batch processing merging tool module for caching the inference service request received in real time into a task cache queue, and then the workload aggregator executes the time T according to the batch processing_batchDetermining a time interval, and counting inference service request quantity in each time interval in a task cache queue to obtain a time sequence; then, the inference service requests in the latest period of time are intercepted from the time sequence for prediction, the service request quantity in the next time interval is obtained and input into the self-adaptionAnd adjusting the left and right boundaries of the current window according to the input information by using the window algorithm model and the self-adaptive window algorithm model, then taking the inference service requests in the window from the task cache queue to generate batch processing tasks, and sending the batch processing tasks to corresponding containers for execution according to the types of resources occupied by the inference service requests.

7. The system of claim 6, wherein the method for predicting the number of service requests, the optimal batch size, and the optimal batch execution time for the next time interval comprises: setting the time sequence as { X1, X2, X3, … Xt }, wherein Xt is an inference service request of the t-th time interval, and setting alpha as a smoothing coefficient; according to recurrence relation

8. The system of claim 7, wherein the adaptive window algorithm model adjusts the left and right boundaries of the current window based on the input information by: first, a window discriminant function is determined

9. The system of claim 6, 7 or 8, wherein the strategy for adjusting the left and right boundaries of the current window is: w_left+1＝W_left+T_finished，W_right+1＝W_left+1+T_length+1(ii) a Wherein, W_left+1、W_right+1Is the adjusted left and right boundaries, T_finishedThe length of time that the batch request has completed within the window.

10. A system according to claim 6 or 7 or 8, characterized in that if the corresponding ratio is recorded

The record is retained.