CN115952866A - Inference method, computer equipment and medium for artificial intelligence inference framework - Google Patents

Inference method, computer equipment and medium for artificial intelligence inference framework Download PDF

Info

Publication number
CN115952866A
CN115952866A CN202310002237.2A CN202310002237A CN115952866A CN 115952866 A CN115952866 A CN 115952866A CN 202310002237 A CN202310002237 A CN 202310002237A CN 115952866 A CN115952866 A CN 115952866A
Authority
CN
China
Prior art keywords
inference
model
request
batch size
instance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310002237.2A
Other languages
Chinese (zh)
Inventor
祖春山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to CN202310002237.2A priority Critical patent/CN115952866A/en
Publication of CN115952866A publication Critical patent/CN115952866A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the invention discloses an inference method, computer equipment and a medium of an artificial intelligence inference framework. In one embodiment, the method comprises: acquiring an inference request; carrying out inference performance evaluation on the artificial intelligent inference framework according to the maximum allowable delay information contained in the inference request and the computing resource occupancy rate of the artificial intelligent inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to an inference performance evaluation result; and loading the inference model to the examples according to the number of the inference requests, the number of the examples of the inference model and the maximum batch size of each example so as to perform inference processing on the inference requests. According to the implementation mode, the dynamic reasoning performance optimization of the AI reasoning framework can be realized through continuous dynamic optimization reasoning configuration and reasoning scheduling under the scenes of cloud computing AI service and edge computing AI service, and the reasoning efficiency is improved.

Description

Inference method, computer equipment and medium for artificial intelligence inference framework
Technical Field
The invention relates to the technical field of artificial intelligence. And more particularly, to a reasoning method, computer device and medium for an artificial intelligence reasoning framework.
Background
At present, under the scenes of cloud computing Artificial Intelligence (AI) service, edge computing AI service and the like, the reasoning efficiency of an AI reasoning framework is difficult to be ensured due to the limitation of computing resources, and particularly under the conditions that a large number of reasoning requests are received in a short time, the maximum allowable delay of the reasoning requests is small and the like, the requirements on the aspects of delay rate and the like are difficult to be met by directly processing the reasoning requests in sequence according to the arrival sequence of the reasoning requests.
Disclosure of Invention
An object of the present invention is to provide an inference method, computer device, and medium for an artificial intelligence inference framework, so as to solve at least one of the problems in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a reasoning method of an artificial intelligence reasoning framework in a first aspect, which comprises the following steps
Acquiring an inference request;
carrying out inference performance evaluation on the artificial intelligent inference framework according to the maximum allowable delay information contained in the inference request and the computing resource occupancy rate of the artificial intelligent inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to an inference performance evaluation result;
and loading the inference model to the examples according to the number of the inference requests, the number of the examples of the inference model and the maximum batch size of each example so as to perform inference processing on the inference requests.
Optionally, the performing inference performance evaluation on the artificial intelligence inference framework according to the maximum allowable delay information included in the inference request and the computing resource occupancy rate of the artificial intelligence inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to the inference performance evaluation result includes:
judging whether the delay requirement satisfaction rate of the current overall inference request is greater than or equal to a preset satisfaction rate threshold or not according to the maximum allowable delay information contained in the inference request:
if so, reducing the number of instances of the inference model corresponding to the inference request meeting the maximum allowable delay and/or the maximum batch size of the instances;
and if not, increasing the number of the instances of the inference model corresponding to the inference request which does not meet the maximum allowable delay and/or the maximum batch size of the instances.
Optionally, the performing inference performance evaluation on the artificial intelligence inference framework according to the maximum allowable delay information included in the inference request and the computing resource occupancy rate of the artificial intelligence inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to the inference performance evaluation result includes:
when judging that a new inference model needs to be loaded according to the acquired inference request, judging whether idle computing resources of the artificial intelligent inference framework meet the computing resource requirements of the new inference model:
if yes, starting a new inference model;
if not, calculating a difference value between the idle calculation resources of the artificial intelligent reasoning framework and the calculation resource requirements of the new reasoning model, and reducing the number of the instances of the reasoning model corresponding to the reasoning request meeting the maximum allowable delay and/or the maximum batch size of the instances according to the difference value.
Optionally, the performing inference performance evaluation on the artificial intelligence inference framework according to the maximum allowable delay information included in the inference request and the computing resource occupancy rate of the artificial intelligence inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to the inference performance evaluation result includes:
deactivating the inference model that has completed the inference process upon determining that there is an inference model that has completed the inference process.
Optionally, the performing inference performance evaluation on the artificial intelligence inference framework according to the maximum allowable delay information included in the inference request and the computing resource occupancy rate of the artificial intelligence inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to the inference performance evaluation result includes: and carrying out inference performance evaluation on the artificial intelligent inference framework according to the maximum allowable delay information contained in the inference request and the computing resource occupancy rate of the artificial intelligent inference framework at a set time interval, and configuring the number of instances of the inference model and the maximum batch size of each instance according to an inference performance evaluation result.
Optionally, said loading the inference model to the instances according to the number of inference requests, the number of instances of the inference model, and the maximum batch size of each instance includes:
judging whether the inference request quantity corresponding to each inference model is larger than a preset maximum inference request quantity threshold value:
if so, setting the batch size of each instance of the inference model as the maximum batch size, and loading the inference model to the instance to perform inference processing on the inference request;
if not, the batch size of each instance of the inference model is set to be a preset batch size, and the inference model is loaded to the instances so as to perform inference processing on the inference request.
Optionally, the loading the inference model to the instances according to the number of inference requests, the number of instances of the inference model, and the maximum batch size of each instance includes:
judging whether the inference request quantity corresponding to each inference model is smaller than a preset minimum inference request quantity threshold value:
if yes, the batch size of each instance of the inference model is set to be a preset minimum batch size, and the inference model is loaded to the instances so as to carry out inference processing on inference requests;
if not, the batch size of each instance of the inference model is set to be a preset batch size, and the inference model is loaded to the instances so as to perform inference processing on the inference request.
Optionally, said loading the inference model to the instances according to the number of inference requests, the number of instances of the inference model, and the maximum batch size of each instance includes:
judging the relation between the inference request quantity corresponding to each inference model and a preset maximum inference request quantity threshold value and a preset minimum inference request quantity threshold value:
if the inference request number corresponding to the inference model is larger than a preset maximum inference request number threshold value, setting the batch size of each instance of the inference model as the maximum batch size, and loading the inference model to the instance to perform inference processing on the inference request;
if the inference request quantity corresponding to the inference model is less than or equal to a preset maximum inference request quantity threshold value and more than or equal to a preset minimum inference request quantity threshold value, setting the batch size of each instance of the inference model as a preset batch size, and loading the inference model to the instance to perform inference processing on the inference request;
and if the reasoning request number corresponding to the reasoning model is smaller than a preset minimum reasoning request number threshold value, setting the batch size of each instance of the reasoning model as a preset minimum batch size, and loading the reasoning model to the instance to perform reasoning processing on the reasoning request.
Optionally, the preset batch size of the example is half of the maximum batch size of the example.
Optionally, the preset minimum batch size of the example is 1.
A second aspect of the invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method provided by the first aspect of the invention when executing the program.
A third aspect of the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method provided by the first aspect of the invention.
The invention has the following beneficial effects:
according to the technical scheme, dynamic reasoning performance optimization of an AI reasoning framework can be realized through continuous dynamic optimization reasoning configuration and reasoning scheduling under the scenes of cloud computing AI service and edge computing AI service, and reasoning efficiency is improved.
Drawings
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
Fig. 1 is a flow chart illustrating an inference method of an artificial intelligence inference framework according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart illustrating inference performance evaluation and inference configuration in the inference method of the artificial intelligence inference framework according to the embodiment of the present invention.
Fig. 3 is a schematic flow chart illustrating inference scheduling for an inference model in the inference method of the artificial intelligence inference framework according to the embodiment of the present invention.
FIG. 4 illustrates a diagram of inference requests and corresponding inference scheduling results.
FIG. 5 illustrates another schematic diagram of inference requests and corresponding inference scheduling results.
FIG. 6 illustrates another diagram of inference requests and corresponding inference scheduling results.
Fig. 7 is another flow chart of the inference method of the artificial intelligence inference framework provided by the embodiment of the invention.
Fig. 8 is a schematic structural diagram of a computer system for executing the inference method of the artificial intelligence inference framework provided by the embodiment of the invention.
Detailed Description
In order to more clearly illustrate the present invention, the present invention will be further described with reference to the following examples and the accompanying drawings. Similar parts in the figures are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, edge computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.
At present, under the scenes of cloud computing AI service, edge computing AI service and the like, the reasoning efficiency of an AI reasoning framework is difficult to be ensured due to the limitation of computing resources, and particularly under the conditions that a large number of reasoning requests are received in a short time, the maximum allowable delay of the reasoning requests is small and the like, the requirements on the aspects of delay rate and the like are difficult to be met by a mode of directly and sequentially processing according to the sequence of the coming reasoning requests.
In view of this, the embodiment of the present invention provides an inference method of an artificial intelligence inference framework.
It should be noted that the inference method of the artificial intelligence inference framework provided in this embodiment is generally executed by an inference server, and the inference server may be hardware or software. When the inference server is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server. When the inference server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
As shown in fig. 1, the inference method of the artificial intelligence inference framework provided in this embodiment includes the following steps:
and S110, acquiring an inference request.
In a specific example, the inference Request (Request) is obtained from, for example, a video server, the video server is connected to, for example, a large number of cameras, and after obtaining video images acquired by the cameras, inference requests including, for example, target detection, face key point positioning, face feature extraction, super-score, noise reduction, and the like are sent out, and the inference server executing the inference method of the artificial intelligence inference framework provided in this embodiment obtains these inference requests.
The inference model deployed in the artificial intelligence inference framework includes, for example, a target detection model, a face key point positioning model, a face feature extraction model, etc. commonly used in the computer vision field, and also includes, for example, a hyper-differentiation model, a noise reduction model, etc. in the image processing field, various inference models exist in the form of model files, and when an inference server executing the inference method of the artificial intelligence inference framework provided in this embodiment acquires an inference request, the inference model of the corresponding type is loaded into one or more instances (instances) through subsequent inference scheduling and inference processing steps to perform inference processing on the inference request. The artificial intelligence reasoning framework includes various types of hardware such as a CPU (central processing Unit), a GPU (graphics processing Unit), an NPU (neural network processing Unit), and the like, and is used to provide computing resources to run an instance. An example may group together multiple inference requests of the same type for parallel inference based on a set Batch Size (Batch Size).
And S120, carrying out inference performance evaluation on the artificial intelligence inference framework according to the maximum allowable delay information contained in the inference request and the computing resource occupancy rate of the artificial intelligence inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to an inference performance evaluation result.
In one possible implementation, step S120 further includes:
judging whether the delay requirement satisfaction rate of the current overall inference request is greater than or equal to a preset satisfaction rate threshold value or not according to the maximum allowable delay information contained in the inference request:
if so, reducing the number of instances of the inference model corresponding to the inference request meeting the maximum allowable delay and/or the maximum batch size of the instances;
and if not, increasing the number of the instances of the inference model corresponding to the inference request which does not meet the maximum allowable delay and/or the maximum batch size of the instances.
Therefore, under the condition that the current overall inference request delay requirement satisfaction rate is larger than or equal to the preset satisfaction rate threshold, computing resources are released by reducing the number of instances of the inference model corresponding to the inference request meeting the maximum allowable delay and/or the maximum batch size of the instances, and under the condition that the current overall inference request delay requirement satisfaction rate is smaller than the preset satisfaction rate threshold, the overall inference request delay requirement satisfaction rate is improved by increasing the number of the instances of the inference model corresponding to the inference request not meeting the maximum allowable delay and/or the maximum batch size of the instances, so that the inference configuration is continuously and dynamically optimized according to the current overall inference request delay requirement satisfaction rate, and the dynamic inference performance optimization of the AI inference framework is realized.
In one possible implementation, step S120 further includes:
when judging that a new inference model needs to be loaded according to the acquired inference request, judging whether idle computing resources of the artificial intelligent inference framework meet the computing resource requirements of the new inference model:
if yes, starting a new inference model;
if not, calculating a difference value between the idle calculation resources of the artificial intelligent reasoning framework and the calculation resource requirements of the new reasoning model, and reducing the number of the instances of the reasoning model corresponding to the reasoning request meeting the maximum allowable delay and/or the maximum batch size of the instances according to the difference value.
Therefore, under the condition that the idle computing resources of the artificial intelligent reasoning framework cannot meet the computing resource requirements of the new reasoning model, the computing resources are released by computing the difference between the idle computing resources of the artificial intelligent reasoning framework and the computing resource requirements of the new reasoning model and reducing the number of instances and/or the maximum batch size of the instances of the reasoning model corresponding to the reasoning request meeting the maximum allowable delay according to the difference, so that under the condition that the situation that the new reasoning model needs to be loaded and the idle computing resources of the artificial intelligent reasoning framework cannot meet the computing resource requirements of the new reasoning model according to the obtained reasoning request is judged, the configuration is dynamically optimized according to the difference between the idle computing resources of the artificial intelligent reasoning framework and the computing resource requirements of the new reasoning model, and the dynamic reasoning performance optimization of the AI framework is realized.
In one possible implementation, step S120 further includes:
and when judging that the inference model which has completed the inference process exists, deactivating the inference model which has completed the inference process.
Therefore, the dynamic inference performance optimization of the AI inference framework can be realized by deactivating the inference model with the completed inference process under the condition that the inference model with the completed inference process exists so as to timely release the computing resources.
In combination with the three implementation manners of the step S120, in a specific example, as shown in fig. 2, the step S120 of implementing inference performance evaluation and inference configuration further includes the following sub-steps:
s1201, after the process is started, judging whether an inference model which finishes inference processing exists: if yes, go to substep S1202; if not, go to substep S1203;
s1202, stopping the inference model which finishes the inference processing to release the computing resources, and ending the process;
s1203, judging whether a new reasoning model needs to be loaded: if yes, go to substep S1204; if not, go to substep S1205;
s1204, judging whether the idle computing resources of the artificial intelligent reasoning framework meet the computing resource requirements of the new reasoning model: if yes, go to substep S1206; if not, go to substep S1207;
s1206, starting a new reasoning model, and ending the process;
s1207, calculating the difference between the idle calculation resources of the artificial intelligent reasoning frame and the calculation resource requirements of the new reasoning model, and turning to a substep S1208;
s1208, reducing the number of instances of the inference model corresponding to the inference request meeting the maximum allowable delay and/or the maximum batch size of the instances according to the difference to release the computing resources, and ending the process;
s1205, judging whether the current delay requirement satisfaction rate lr of the overall inference request is greater than or equal to a preset satisfaction rate threshold Tlr: if yes, go to substep S1209; if not, go to substep S1210;
s1209, reducing the number of instances of the inference model corresponding to the inference request meeting the maximum allowable delay and/or the maximum batch size of the instances to release the computing resources, and ending the process;
s1210, increasing the number of instances of the inference model corresponding to the inference request which does not meet the maximum allowable delay and/or the maximum batch size of the instances to improve the delay requirement satisfaction rate of the overall inference request, and ending the process.
Referring to fig. 2, the inference performance evaluation and inference configuration implemented in step S120 is to perform configuration of the number of instances, the type of instances, the maximum batch size of the instances, and the maximum allowable delay of the instances on each inference model after dynamically evaluating inference performance and inference demand conditions.
The number of inference models can be dynamically adjusted through inference configuration, each inference model can correspond to one or more instances, the number of instances and the type of the instances of each inference model can also be dynamically adjusted through inference configuration, and the type of the instances is hardware corresponding to the instances, such as a CPU, a GPU, an NPU and the like.
Further, each instance, the maximum Batch Size (Max Batch Size) of the instance and the maximum allowable delay (Max Latency) of the instance may be dynamically adjusted by inferential configuration.
In this embodiment, the inference performance evaluation implemented in step S120 is to evaluate the inference performance by combining inference requirements, and the inference performance includes: the actual delay of each inference request and the occupancy rate of the computing resources (such as CPU, GPU, NPU and memory) of the artificial intelligence inference framework, namely the occupancy rate of the computing resources of the whole inference model. The reasoning requirements include: maximum allowable delay information for each inference request, the need to load a new inference model, and the existence of an inference model that has completed the inference process.
For example, indicators of inference performance evaluation include:
(1) The overall inference request latency requirement satisfaction rate lr (it is understood that the larger the value of the overall inference request latency requirement satisfaction rate lr, the better):
Figure BDA0004035575230000081
where NumRequest represents the number of inference requests and numlatencalsatisfield represents the number of inference requests with an actual delay less than or equal to the maximum allowed delay.
(2) Occupancy of computational resources of an artificial intelligence reasoning framework.
The larger the overall inference request delay requirement satisfaction rate lr is, the better the overall inference request delay requirement satisfaction rate lr is, and the smaller the occupancy rate of the computing resources of the artificial intelligence inference framework is, the better the overall inference request delay requirement satisfaction rate lr is.
In this embodiment, the configuration object of the inference configuration implemented in step S120 includes: the number of instances and the type of instances for each inference model, as well as the maximum batch size for each instance and the maximum allowable delay for each instance. The inference configuration aims at improving the delay requirement satisfaction rate lr of the total inference request and reducing the occupancy rate of the computing resources of the artificial intelligence inference framework, for example, reducing the occupancy rate of the computing resources of the artificial intelligence inference framework as much as possible on the premise that lr is greater than or equal to a preset satisfaction rate threshold Tlr, wherein the preset satisfaction rate threshold Tlr may be preset according to an actual scene, for example, if all inference requests are required to be completed within the maximum allowable delay of the inference request, the preset satisfaction rate threshold Tlr may be set to 100%, and if a small amount of delay is allowed, the preset satisfaction rate threshold Tlr may be set to 99%, 98%, and the like.
In a possible implementation manner, step S120 is performed at set time intervals, for example, step S120 is performed every N seconds, where N is greater than or equal to 1 second and less than or equal to 5 seconds. Therefore, the real-time performance of the inference performance evaluation and the inference configuration realized in step S120 can be further ensured.
And S130, loading the inference model to the instances according to the number of the inference requests, the number of the instances of the inference model and the maximum batch size of each instance so as to perform inference processing on the inference requests.
Step S130 implements inference scheduling and inference processing, where the inference scheduling is used to efficiently allocate inference requests in an inference request queue corresponding to each inference model to corresponding instances, and the inference scheduling integrates the instances and processing conditions of the inference requests (e.g., whether an instance is performing inference processing, a batch size set by an instance, etc.).
In a first possible implementation manner, step S130 further includes:
judging whether the inference request quantity corresponding to each inference model is larger than a preset maximum inference request quantity threshold value:
if so, setting the batch size of each instance of the inference model as the maximum batch size, and loading the inference model to the instance to perform inference processing on the inference request;
if not, the batch size of each instance of the inference model is set to be a preset batch size, and the inference model is loaded to the instances so as to perform inference processing on the inference request.
Therefore, under the condition that the number of the inference requests corresponding to the inference model is larger than the preset maximum threshold value of the number of the inference requests, namely the number of the inference requests corresponding to the inference model is larger, the priority improvement of the throughput (namely the number of the inference requests completed per second) is realized by setting the batch size of each instance of the inference model to be the maximum batch size; and under the condition that the number of the inference requests corresponding to the inference model is less than or equal to the preset maximum inference request number threshold, setting the batch size of each instance of the inference model to be the preset batch size serving as the optimal batch size, and considering both the improvement of throughput and the reduction of delay of the inference requests. Therefore, the dynamic reasoning performance optimization of the AI reasoning framework is further realized through the reasoning scheduling mode.
In a second possible implementation manner, step S130 further includes:
judging whether the inference request quantity corresponding to each inference model is smaller than a preset minimum inference request quantity threshold value:
if so, setting the batch size of each instance of the inference model as a preset minimum batch size, and loading the inference model to the instance to perform inference processing on the inference request;
if not, the batch size of each instance of the inference model is set to be a preset batch size, and the inference model is loaded to the instances so as to perform inference processing on the inference request.
Therefore, the delay of the inference request can be reduced preferentially by setting the batch size of each instance of the inference model to be the minimum batch size under the condition that the inference request number corresponding to the inference model is smaller than the preset minimum inference request number threshold value, namely the inference request number corresponding to the inference model is smaller; and under the condition that the number of the inference requests corresponding to the inference model is greater than or equal to the preset maximum inference request number threshold, setting the batch size of each instance of the inference model to be the preset batch size serving as the optimal batch size, and considering both the improvement of throughput and the reduction of delay of the inference requests. Therefore, the dynamic reasoning performance optimization of the AI reasoning framework is further realized through the reasoning scheduling mode.
In a third possible implementation manner, step S130 further includes:
judging the relation between the inference request quantity corresponding to each inference model and a preset maximum inference request quantity threshold value and a preset minimum inference request quantity threshold value:
if the inference request number corresponding to the inference model is larger than a preset maximum inference request number threshold value, the batch size of each instance of the inference model is set as the maximum batch size, and the inference model is loaded to the instance so as to perform inference processing on the inference request;
if the inference request quantity corresponding to the inference model is less than or equal to a preset maximum inference request quantity threshold value and greater than or equal to a preset minimum inference request quantity threshold value, setting the batch size of each instance of the inference model as a preset batch size, and loading the inference model to the instance to perform inference processing on the inference request;
and if the reasoning request number corresponding to the reasoning model is smaller than a preset minimum reasoning request number threshold value, setting the batch size of each instance of the reasoning model as a preset minimum batch size, and loading the reasoning model to the instance to perform reasoning processing on the reasoning request.
Therefore, the priority improvement of the throughput can be realized by setting the batch size of each instance of the inference model to be the maximum batch size under the condition that the number of the inference requests corresponding to the inference model is larger than the preset maximum threshold value of the number of the inference requests, namely the number of the inference requests corresponding to the inference model is larger; under the condition that the inference request quantity corresponding to the inference model is less than or equal to a preset maximum inference request quantity threshold value and greater than or equal to a preset minimum inference request quantity threshold value, the batch size of each instance of the inference model is set to be a preset batch size serving as an optimal batch size, and the throughput is improved and the delay of inference requests is reduced; and under the condition that the inference request number corresponding to the inference model is smaller than a preset minimum inference request number threshold value, namely the inference request number corresponding to the inference model is smaller, the delay of the inference request is reduced preferentially by setting the batch size of each instance of the inference model to be the minimum batch size. Therefore, the dynamic reasoning performance optimization of the AI reasoning framework is further realized through the reasoning scheduling mode.
Further, for the first to third implementation manners of step S130, the preset batch size of the example is one half of the maximum batch size of the example. For example, if the maximum batch size of an instance is 8, then the preset batch size of the instance as the optimal batch size is 4, i.e., the instance pieces together 4 inference requests and processes them in parallel.
Further, for the second and third implementation manners of step S130, the preset minimum batch size of the example is 1.
It is understood that the above third implementation manner of step S130 is a combination of the first implementation manner and the second implementation manner, and the above third implementation manner of step S130 is exemplified below. In a specific example, for an inference model, as shown in fig. 3, the inference scheduling process in the third implementation manner of the above step S130 includes the following sub-steps:
s1301, after the process starts, determining whether the length of the inference request queue (i.e., the number of inference requests) is greater than a preset maximum inference request number threshold Max Thr (Max Thr, for example, takes a value of 20): if yes, go to substep S1302; if not, go to substep S1303;
s1302, setting the batch size of each instance of the inference model to be the maximum batch size so as to realize priority improvement of throughput, and ending the process (when the inference scheduling process is ended, the inference model is subsequently loaded to the instance so as to carry out inference processing on the inference request);
s1303, determining whether the length of the inference request queue (i.e., the number of inference requests) is smaller than a preset minimum inference request number threshold Min Thr (Min Thr takes a value of 5, for example): if yes, go to substep S1304; if not, go to substep S1305;
s1304, setting the batch size of each instance of the inference model to be a preset minimum batch size so as to realize priority reduction of delay of inference requests, and ending the process;
s1305, setting the batch size of each instance of the inference model to a preset batch size that is an optimal batch size (for example, the preset batch size of an instance is one half of the maximum batch size of an instance), so as to improve throughput and reduce delay of an inference request, and ending the process.
An example of the inference scheduling process in the third implementation of the above step S130 shown in fig. 3 is as follows.
As shown in fig. 4, for the hyper-score model, for example, in the current time and the subsequent time, based on the inference performance evaluation and the inference configuration implemented in step S120, the hyper-score model corresponds to two instances, which are instance 1 and instance 2, the maximum batch size of instance 1 is 4, and the maximum batch size of instance 2 is 8.
As shown in fig. 4, 21 super-resolution requests numbered 1 to 21 are successively acquired at time T1, the length of a super-resolution request queue at time T1 is 21 and is greater than a preset maximum inference request number threshold Max Thr =20, then, at time T1 to time T2, the batch size of example 1 is set to be the maximum batch size 4, the batch size of example 2 is set to be the maximum batch size 8, example 1 groups super-resolution requests 1 to 4 together for inference, and example 2 groups super-resolution requests 5 to 12 together for inference in parallel.
As shown in fig. 4 and 5, 4 super-resolution requests numbered 22-25 are successively acquired at time T2, and 9 super-resolution requests numbered 13-21 and not yet subjected to inference are added, where the length of the super-resolution request queue is 13, the size of the batch of example 1 is set to be a preset batch size 2 (2=4/2), the batch size of example 2 is set to be a preset batch size 4 (4=8/2), the batch of example 1 is pieced together to perform parallel inference on the super-resolution requests 13-14, and the batch of example 2 is pieced together to perform parallel inference on the super-resolution requests 15-18, and the length of the super-resolution request queue is 13, and is not less than a preset maximum inference request number threshold Max Thr =20, and is greater than or equal to a preset minimum inference request number threshold Min Thr = 5.
As shown in fig. 5, 1 super-resolution request with the number of 26 is acquired again at time T3, and 7 super-resolution requests with the number of 19-25, which have not been subjected to inference yet, are added, where the length of the super-resolution request queue is 8, and is less than or equal to a preset maximum inference request number threshold Max Thr =20 and greater than or equal to a preset minimum inference request number threshold Min Thr =5, then, at time T3-T4, the batch size of example 1 is set to be a preset batch size 2 (2=4/2), the batch size of example 2 is set to be a preset batch size 4 (4=8/2), example 1 pieces together the super-resolution requests 19-20 for parallel inference, and example 2 pieces together the super-resolution requests 21-24 for parallel inference.
As shown in fig. 5 and 6, at time T4, 1 super-resolution request numbered 27 is acquired, and 2 super-resolution requests numbered 25 to 26, which have not been subjected to inference yet, are added, the length of the super-resolution request queue is 3 and is smaller than the preset minimum inference request number threshold Min Thr =5, then, at time T4 to time T5, the batch size of example 1 is set to be the minimum batch size 1, the batch size of example 2 is set to be the minimum batch size 1, example 1 infers the super-resolution request 25, and example 2 infers the super-resolution request 26.
As shown in fig. 6, at time T5, 1 super-score request with a number of 28 is acquired again, and in addition to 1 super-score request with a number of 27 that has not yet been inferred, the length of the super-score request queue is 2, and is smaller than a preset minimum inference request quantity threshold Min Thr =5, then at time T5-T6, the batch size of example 1 is set to be the minimum batch size 1, the batch size of example 2 is set to be the minimum batch size 1, example 1 infers the super-score request 27, and example 2 infers the super-score request 28.
In addition, for the maximum allowable delay of the instance configured in step S120, which is used for the definition of the instance batch size setting in step S130, for example, if the waiting time of the instance batch size setting is longer than the maximum allowable delay of the instance, even if the inference request queue length is such that at least part of the instance batch size cannot be set to a desired value, the inference processing will be immediately performed, for example, in the above example, the length of the super-distribution request queue at a certain time is 5, and at least one of the instances 1 and 2 has reached the maximum allowable delay, although the instance 1 batch size is set to 2 and the instance 2 batch size is set to 4, the instance reaching the maximum allowable delay will start the inference processing.
It should be understood by those skilled in the art that the above steps are described in the order of S110-S130, but the above steps are not necessarily performed in this order, for example, the step S120 of implementing inference performance evaluation and inference configuration is performed continuously at set time intervals, and the step S130 of implementing inference scheduling and inference processing is performed when unprocessed inference requests exist in the inference request queue or inference requests are newly acquired after the inference model completes inference processing on the assigned inference requests, and in addition, the step S110 may be performed in real time, so there may be cases where the step S120 is performed in parallel with the step S130, the step S130 is performed first, and then the step S120 is performed, and so on, where if the step S120 performs inference configuration in the process of implementing inference scheduling in the step S130, and the number of instances and/or the maximum batch size of the inference model is updated, the number of instances and/or the maximum batch size of instances are updated in the next inference scheduling. Based on the above, the flow of the inference method of the artificial intelligence inference framework provided in this embodiment may also be described as follows with reference to fig. 7: firstly, after the whole process is started, loading default parameters by an artificial intelligence reasoning framework for initialization; then, the artificial intelligence reasoning framework starts reasoning service, the main body of the reasoning service is reasoning scheduling and reasoning processing of a group of reasoning models, and simultaneously, reasoning performance evaluation and reasoning configuration are carried out at set time intervals; the reasoning service will continue until the end signal is received, the computing resources are released and the process is ended.
In summary, the inference method of the artificial intelligence inference framework provided by this embodiment can implement dynamic inference performance optimization of the AI inference framework through continuous dynamic optimization inference configuration and inference scheduling in the cloud computing AI service and edge computing AI service scenarios, thereby improving inference efficiency.
As shown in fig. 8, a computer system suitable for executing the inference method of the artificial intelligence inference framework provided by the above-described embodiments includes a central processing module (CPU) which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) or a program loaded from a storage section into a Random Access Memory (RAM). In the RAM, various programs and data necessary for the operation of the computer system are also stored. The CPU, ROM, and RAM are connected thereto via a bus. An input/output (I/O) interface is also connected to the bus.
An input section including a keyboard, a mouse, and the like; an output section including a speaker and the like such as a Liquid Crystal Display (LCD); a storage section including a hard disk and the like; and a communication section including a network interface card such as a LAN card, a modem, or the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the I/O interface as needed. A removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive as needed, so that the computer program read out therefrom is mounted into the storage section as needed.
In particular, the processes described in the above flowcharts may be implemented as computer software programs according to the present embodiment. For example, the present embodiments include a computer program product comprising a computer program tangibly embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium.
The flowchart and schematic diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to the present embodiments. In this regard, each block in the flowchart or schematic diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the schematic and/or flowchart illustration, and combinations of blocks in the schematic and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
On the other hand, the present embodiment also provides a nonvolatile computer storage medium, which may be the nonvolatile computer storage medium included in the apparatus in the foregoing embodiment, or may be a nonvolatile computer storage medium that exists separately and is not assembled into a terminal. The non-transitory computer storage medium stores one or more programs that, when executed by a device, cause the device to: acquiring an inference request; carrying out inference performance evaluation on the artificial intelligent inference framework according to the maximum allowable delay information contained in the inference request and the computing resource occupancy rate of the artificial intelligent inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to an inference performance evaluation result; and loading the inference model to the examples according to the number of the inference requests, the number of the examples of the inference model and the maximum batch size of each example so as to perform inference processing on the inference requests.
In the description of the present invention, it should be noted that the terms "upper", "lower", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, which are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and operate, and thus, should not be construed as limiting the present invention. Unless expressly stated or limited otherwise, the terms "mounted," "connected," and "connected" are intended to be inclusive and mean, for example, that they may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.
It is further noted that, in the description of the present invention, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the invention and are not intended to limit the embodiments of the present invention, and that various other modifications and variations can be made by one skilled in the art in light of the above description.

Claims (12)

1. An inference method of an artificial intelligence inference framework is characterized by comprising
Acquiring an inference request;
carrying out inference performance evaluation on the artificial intelligent inference framework according to the maximum allowable delay information contained in the inference request and the computing resource occupancy rate of the artificial intelligent inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to an inference performance evaluation result;
and loading the inference model to the examples according to the number of the inference requests, the number of the examples of the inference model and the maximum batch size of each example so as to perform inference processing on the inference requests.
2. The method of claim 1, wherein the performing inference performance evaluation on the artificial intelligence inference framework according to the maximum allowable delay information included in the inference request and the computing resource occupancy of the artificial intelligence inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to the inference performance evaluation result comprises:
judging whether the delay requirement satisfaction rate of the current overall inference request is greater than or equal to a preset satisfaction rate threshold value or not according to the maximum allowable delay information contained in the inference request:
if so, reducing the number of instances of the inference model corresponding to the inference request meeting the maximum allowable delay and/or the maximum batch size of the instances;
and if not, increasing the number of the instances of the inference model corresponding to the inference request which does not meet the maximum allowable delay and/or the maximum batch size of the instances.
3. The method of claim 2, wherein the performing inference performance evaluation on the artificial intelligence inference framework according to the maximum allowable delay information included in the inference request and the computing resource occupancy of the artificial intelligence inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to the inference performance evaluation result comprises:
when judging that a new inference model needs to be loaded according to the acquired inference request, judging whether idle computing resources of the artificial intelligent inference framework meet the computing resource requirements of the new inference model:
if yes, starting a new inference model;
if not, calculating a difference value between the idle calculation resources of the artificial intelligent reasoning framework and the calculation resource requirements of the new reasoning model, and reducing the number of the instances of the reasoning model corresponding to the reasoning request meeting the maximum allowable delay and/or the maximum batch size of the instances according to the difference value.
4. The method of claim 1, wherein the performing inference performance evaluation on the artificial intelligence inference framework according to the maximum allowable delay information included in the inference request and the computing resource occupancy of the artificial intelligence inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to the inference performance evaluation result comprises:
deactivating the inference model that has completed the inference process upon determining that there is an inference model that has completed the inference process.
5. The method of claim 1, wherein the performing inference performance evaluation on the artificial intelligence inference framework according to the maximum allowable delay information included in the inference request and the computing resource occupancy rate of the artificial intelligence inference framework, and configuring the number of instances of the inference model and the maximum batch size of each instance according to the inference performance evaluation result comprises: and carrying out inference performance evaluation on the artificial intelligent inference framework according to the maximum allowable delay information contained in the inference request and the computing resource occupancy rate of the artificial intelligent inference framework at a set time interval, and configuring the number of instances of the inference model and the maximum batch size of each instance according to an inference performance evaluation result.
6. The method of claim 1, wherein loading inference models to instances based on a number of inference requests, a number of instances of inference models, and a maximum batch size for each instance comprises:
judging whether the inference request quantity corresponding to each inference model is larger than a preset maximum inference request quantity threshold value:
if so, setting the batch size of each instance of the inference model as the maximum batch size, and loading the inference model to the instance to perform inference processing on the inference request;
if not, the batch size of each instance of the inference model is set to be a preset batch size, and the inference model is loaded to the instances so as to perform inference processing on the inference request.
7. The method of claim 1, wherein loading inference models to instances based on a number of inference requests, a number of instances of the inference models, and a maximum batch size for each instance comprises:
judging whether the inference request quantity corresponding to each inference model is smaller than a preset minimum inference request quantity threshold value:
if so, setting the batch size of each instance of the inference model as a preset minimum batch size, and loading the inference model to the instance to perform inference processing on the inference request;
if not, the batch size of each instance of the inference model is set to be a preset batch size, and the inference model is loaded to the instances so as to perform inference processing on the inference request.
8. The method of claim 1, wherein loading inference models to instances based on a number of inference requests, a number of instances of the inference models, and a maximum batch size for each instance comprises:
judging the relation between the inference request quantity corresponding to each inference model and a preset maximum inference request quantity threshold value and a preset minimum inference request quantity threshold value:
if the inference request number corresponding to the inference model is larger than a preset maximum inference request number threshold value, the batch size of each instance of the inference model is set as the maximum batch size, and the inference model is loaded to the instance so as to perform inference processing on the inference request;
if the inference request quantity corresponding to the inference model is less than or equal to a preset maximum inference request quantity threshold value and greater than or equal to a preset minimum inference request quantity threshold value, setting the batch size of each instance of the inference model as a preset batch size, and loading the inference model to the instance to perform inference processing on the inference request;
and if the reasoning request number corresponding to the reasoning model is smaller than a preset minimum reasoning request number threshold value, setting the batch size of each instance of the reasoning model as a preset minimum batch size, and loading the reasoning model to the instance to perform reasoning processing on the reasoning request.
9. The method of claims 6-8, wherein the predetermined lot size of the instance is one-half of a maximum lot size of the instance.
10. The method according to claim 7 or 8, wherein the preset minimum batch size of the instance takes the value 1.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-10 when executing the program.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-10.
CN202310002237.2A 2023-01-03 2023-01-03 Inference method, computer equipment and medium for artificial intelligence inference framework Pending CN115952866A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310002237.2A CN115952866A (en) 2023-01-03 2023-01-03 Inference method, computer equipment and medium for artificial intelligence inference framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310002237.2A CN115952866A (en) 2023-01-03 2023-01-03 Inference method, computer equipment and medium for artificial intelligence inference framework

Publications (1)

Publication Number Publication Date
CN115952866A true CN115952866A (en) 2023-04-11

Family

ID=87287446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310002237.2A Pending CN115952866A (en) 2023-01-03 2023-01-03 Inference method, computer equipment and medium for artificial intelligence inference framework

Country Status (1)

Country Link
CN (1) CN115952866A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594846A (en) * 2023-07-14 2023-08-15 支付宝(杭州)信息技术有限公司 Inference service monitoring method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594846A (en) * 2023-07-14 2023-08-15 支付宝(杭州)信息技术有限公司 Inference service monitoring method and device

Similar Documents

Publication Publication Date Title
CN115952866A (en) Inference method, computer equipment and medium for artificial intelligence inference framework
CN112905326B (en) Task processing method and device
WO2019184822A1 (en) Multi-media file processing method and device, storage medium and electronic device
CN111294647A (en) Video processing method, device and equipment and storage medium
CN111527501A (en) Chip adaptation determining method and related product
WO2020038127A1 (en) Decoding method and apparatus, electronic device, and storage medium
WO2023201947A1 (en) Methods, systems, and storage media for task dispatch
CN116980569A (en) Security monitoring system and method based on cloud computing
CN114679607A (en) Video frame rate control method and device, electronic equipment and storage medium
CN111970539B (en) Data coding method based on deep learning and cloud computing service and big data platform
CN111970565A (en) Video data processing method and device, electronic equipment and storage medium
CN109388501B (en) Communication matching method, device, equipment and medium based on face recognition request
CN109561346A (en) A kind of distributed analytic method and system of video
US11635997B2 (en) Dataflow optimization apparatus and method for low-power operation of multicore systems
CN112672211A (en) Negative feedback code stream decoding method under intelligent monitoring scene
CN112598112B (en) Resource scheduling method based on graph neural network
KR101932130B1 (en) Apparatus and method for improving quality of experience of remote display
CN113886030A (en) Resource scheduling method, electronic device and storage medium
CN115617421B (en) Intelligent process scheduling method and device, readable storage medium and embedded equipment
CN113507692B (en) UART communication-based Beidou short message acquisition method, device, equipment and medium
CN116680086B (en) Scheduling management system based on offline rendering engine
CN117785471A (en) Picture processing method, device, equipment and storage medium
US20230421779A1 (en) Decoding processing method and apparatus, computer device, and storage medium
CN117726588A (en) Image analysis method, device, electronic equipment and storage medium
CN116980616A (en) Mode decision scheduling method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination