CN110837419B

CN110837419B - Reasoning engine system and method based on elastic batch processing and electronic equipment

Info

Publication number: CN110837419B
Application number: CN201911088741.9A
Authority: CN
Inventors: 陈�全; 过敏意; 崔炜皞; 沈耀; 姚斌
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2023-05-19
Anticipated expiration: 2039-11-08
Also published as: CN110837419A

Abstract

The invention provides an elastic batch processing-based reasoning engine system, a method and electronic equipment, wherein the elastic batch processing-based reasoning engine method comprises the following steps: acquiring to-be-inferred request data input by a user; obtaining the maximum parallel batch processing quantity and the quantity of requests to be inferred; and organizing the to-be-processed reasoning request data into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the to-be-reasoning request number, waking up a sub-engine corresponding to the batch processing data in a deep neural network reasoning engine module, and processing the to-be-processed reasoning request by the sub-engine. The invention maximizes the response delay speed and throughput of the engine system without increasing hardware devices including graphics processors and the like.

Description

Reasoning engine system and method based on elastic batch processing and electronic equipment

Technical Field

The invention relates to the technical field of processors, in particular to the technical field of GPU, and specifically relates to an inference engine system and method based on elastic batch processing and electronic equipment.

Background

With the massive deployment of computationally intensive applications such as speech recognition, machine translation, personal assistants, etc., mainstream private data centers or public cloud platforms have begun to use coprocessors like GPUs in large quantities to cope with the problem of insufficient computational power of traditional CPUs. GPUs were initially dedicated processors designed for graphics image computation, and more non-graphics image applications were migrated to GPU platforms to meet their rapidly growing computing demands due to their unmatched parallelism with conventional CPUs. Research has shown that non-graphics image applications often do not have sufficient parallelism to fully utilize the hardware resources of the GPU, resulting in wasted hardware resources. On the other hand, due to the development of GPU architecture and process, more and more stream multiprocessors (Streaming Multiprocessor, SM) are integrated into one GPU, so that the problem of resource waste is more prominent.

GPUs have been widely used to provide deep learning-based online services, and how to improve the performance of GPUs, including lower response delay, higher system throughput, etc., are technical problems that those skilled in the art need to solve.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide an inference engine system, a method and an electronic device based on elastic batch processing, for improving the performance of a GPU processor engine.

To achieve the above and other related objects, an embodiment of the present invention provides an elastic batch-based inference engine method, including: acquiring to-be-inferred request data input by a user; obtaining the maximum parallel batch processing quantity and the quantity of requests to be inferred; and organizing the to-be-processed reasoning request data into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the to-be-reasoning request number, waking up a sub-engine corresponding to the batch processing data in a deep neural network reasoning engine module, and processing the to-be-processed reasoning request by the sub-engine.

In an embodiment of the present application, the reasoning engine method based on elastic batch processing further includes: constructing a deep neural network reasoning model; configuring sub-engines in the deep neural network reasoning engine module according to the deep neural network reasoning model; the deep neural network reasoning engine module comprises a plurality of sub-engines for processing different batch data sizes, and each sub-engine shares parameters of the deep neural network reasoning model.

In an embodiment of the present application, the reasoning engine method based on elastic batch processing further includes: storing the data to be inferred request input by the user into a video memory buffer pool formed by continuous video memory address space; the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing one to-be-inferred request data.

In one embodiment of the present application, the maximum number of parallel batches is obtained based on the longest response time for processing the inference request.

In an embodiment of the present application, the organizing the to-be-processed reasoning request data into batch data with a suitable batch size as required, and waking up a sub-engine corresponding to the batch data in the deep neural network reasoning engine module includes: detecting whether an idle sub-engine exists in an engine queue of a deep neural network reasoning engine module, and if so, selecting the number of schedulable requests to be reasoning from the video memory buffer pool; organizing the to-be-processed reasoning request data into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the to-be-reasoning request quantity; scheduling a sub-engine corresponding to the size of the batch data from the idle sub-engines.

In an embodiment of the present application, if a single sub-engine does not process the request to be inferred at a time, the sub-engine corresponding to the remaining request to be inferred is continuously obtained from the idle sub-engines, and the remaining request to be inferred is continuously processed.

The embodiment of the invention also provides an reasoning engine system based on elastic batch processing, which comprises: the video memory buffer pool module is used for acquiring and storing to-be-inferred request data input by a user; the reasoning engine module is used for configuring a plurality of sub-engines in the deep neural network reasoning engine module; the elastic batch processing scheduling module is used for acquiring the maximum parallel batch processing quantity and the quantity of the to-be-inferred requests, organizing the to-be-inferred request data into batch processing data with proper batch processing size according to the maximum parallel batch processing quantity and the quantity of the to-be-inferred requests, and waking up sub-engines corresponding to the batch processing data in the deep neural network inference engine module, wherein the sub-engines process the to-be-inferred requests.

In an embodiment of the present application, the flexible batch scheduling module includes: the idle engine management unit is used for detecting whether idle sub-engines exist in an engine queue of the deep neural network reasoning engine module, and if so, the number of schedulable to-be-reasoning requests is selected from the video memory buffer pool; the batch processing data processing unit is used for organizing the to-be-processed reasoning request data into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the to-be-reasoning request quantity; and the engine scheduling unit is used for scheduling the sub-engine corresponding to the size of the batch processing data from the idle sub-engines.

Embodiments of the present invention also provide an electronic device comprising a processor and a memory, the memory storing program instructions that are executed by the processor to implement the flexible batch-based inference engine method as described above.

As described above, the reasoning engine system, the method and the electronic device based on the elastic batch processing have the following beneficial effects:

1. the invention establishes a system comprising a video memory buffer pool module, a multi-granularity reasoning engine module and an elastic batch processing scheduling module, realizes a deep neural network reasoning engine system with low delay, high throughput and based on elastic batch processing, and maximizes the response delay speed and throughput of the whole engine system on the premise of not increasing hardware equipment comprising a graphic processor and the like.

2. The achievement of the invention can provide support for the emerging deep neural network technology landing, and can lead the construction of a neural network reasoning task scheduling system with commercial significance and based on the elastic batch processing technology, and the optimization of the neural network reasoning scheduling service is simplified for users.

Drawings

FIG. 1 is a schematic diagram of the overall flow of the inference engine method based on elastic batch processing of the present invention.

FIG. 2 is a flow chart of a scheduling sub-engine in the flexible batch-based inference engine method of the present invention.

Fig. 3 and 4 are schematic diagrams illustrating embodiments of the flexible batch-based reasoning engine method of the present invention.

Fig. 5 is a block diagram showing the overall schematic of the reasoning engine system based on elastic batch processing of the present invention.

FIG. 6 is a schematic block diagram of an elastic batch scheduling module in the elastic batch based reasoning engine system of the present invention.

Fig. 7 is a schematic structural diagram of an electronic terminal according to an embodiment of the present application.

Description of element reference numerals

100. Reasoning engine system based on elastic batch processing

110. Buffer tank module for video memory

120. Reasoning engine module

130. Elastic batch processing scheduling module

131. Idle engine management unit

132. Batch processing data processing unit

133. Engine scheduling unit

1101. Processor and method for controlling the same

1102. Memory device

S110 to S130 steps

S131 to S134 steps

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

The embodiment aims to provide an reasoning engine system, a reasoning engine method and electronic equipment based on elastic batch processing, which are used for improving the performance of a GPU processor engine.

The embodiment aims to design and realize a low-delay high-throughput deep neural network reasoning engine system based on elastic batch processing, and provides three modules, which comprises: the system comprises a video memory buffer pool at a graphic processor end, a multi-granularity deep neural network reasoning engine and an elastic batch processing scheduler, wherein decoupling of calculation and data transmission is completed respectively, parallel execution of batch processing tasks with different sizes and batch processing with different sizes are organized elastically, so that application developers can conveniently land on the deep neural network, better performance is obtained through fewer parameter configurations, and the system throughput is higher.

The principle and implementation of the elastic batch-based inference engine system, method and electronic device of the present embodiment will be described in detail below, so that those skilled in the art may understand the elastic batch-based inference engine system, method and electronic device of the present invention without creative effort.

As shown in fig. 1, the embodiment provides an inference engine method based on elastic batch processing, which includes the following steps:

step S110, obtaining to-be-inferred request data input by a user;

step S120, obtaining the maximum parallel batch processing quantity and the quantity of the to-be-inferred requests;

and step S130, organizing the data of the to-be-processed reasoning request into batch processing data with proper batch processing size according to the maximum parallel batch processing quantity and the quantity of the to-be-reasoning request, waking up a sub-engine corresponding to the batch processing data in a deep neural network reasoning engine module, and processing the to-be-processed reasoning request by the sub-engine.

The following describes the steps S110 to S130 of the inference engine method based on elastic batch processing in this embodiment in detail.

Step S110, obtaining the to-be-inferred request data input by the user.

In this embodiment, the reasoning engine method based on elastic batch processing further includes: storing the data to be inferred request input by the user into a video memory buffer pool formed by continuous video memory address space; the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing one to-be-inferred request data.

Specifically, in this embodiment, the video memory buffer pool mainly comprises a segment of continuous video memory address space allocated at the graphics processing end, where the segment of continuous video memory address space is divided into video memory blocks with the same size, so as to store the input data (i.e. the data of the to-be-inferred request) of each to-be-executed inference request. And caching the reasoning request to be processed into a video memory of the graphics processor, and organizing a plurality of different input data of the reasoning request into batch processing input data.

Step S120, the maximum parallel batch processing number and the number of requests to be inferred are obtained.

In this embodiment, the maximum number of parallel batches is obtained according to the longest response time for processing the reasoning request.

In order to meet the service quality requirement of the whole elastic batch, an application developer needs to set the longest response time of acceptable reasoning request service, and accordingly the acceptable maximum parallel batch size is obtained, namely, the user self-defines the acceptable processing delay of the longest depth neural network reasoning request as the service quality requirement according to the requirement of the user. In the following scheduling process, the parameter of the longest response time of the processing reasoning request is taken as the upper limit of the sum of the batch processing sizes of all the working sub-engines, and the scheduling of the whole system is carried out according to the upper limit.

Specifically, as shown in fig. 2, in this embodiment, the organizing the to-be-processed reasoning request data into batch data with a suitable batch size as required, and waking up a sub-engine corresponding to the batch data in the deep neural network reasoning engine module includes:

step S131, detecting whether an idle sub-engine exists in an engine queue of the deep neural network reasoning engine module, if yes, executing step S132, selecting the number of schedulable to-be-inferred requests from the video memory buffer pool, and if not, continuously detecting whether the idle sub-engine exists in the engine queue of the deep neural network reasoning engine module.

Checking the idle sub-engine queue status: checking the state of an idle sub-engine queue in the multi-granularity reasoning engine, and if no idle sub-engine exists, circularly inquiring the idle sub-engine queue until the idle sub-engine queue exists; if there is an idle sub-engine to proceed to the next step, the scheduling is attempted.

Step S133, organizing the to-be-processed reasoning request data into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the to-be-reasoning request quantity;

checking the buffer pool state: checking the state of the video memory buffer pool, and returning to continuously detect the state of the idle sub-engine queue if no request exists; if the request to be scheduled is available, selecting a schedulable request from the video memory buffer pool.

Selecting a schedulable request from a video memory buffer pool: in order to ensure the continuity of different request data storage positions, the most schedulable requests selected from the existing storage positions are scheduled from a video memory buffer pool.

Step S134, scheduling a sub-engine corresponding to the size of the batch data from the idle sub-engines.

Scheduling the appropriate sub-engine to process the request: the data is organized into batch processing data of proper size according to idle sub-engines in the multi-granularity reasoning engine, and the sub-engines are called to process the data.

In this embodiment, if a single sub-engine does not process the request to be inferred at a time, the sub-engine corresponding to the remaining request to be inferred is continuously acquired from the idle sub-engine, and the remaining request to be inferred is continuously processed.

Namely, there are schedulable request and idle sub-engines: the selected schedulable request may not be processed by a single sub-engine that is scheduled at a time. If there are schedulable requests and there are idle sub-engines, jump to step S134 to continue scheduling; otherwise, the process jumps to step S131 to repeat the scheduling process.

In this embodiment, the reasoning engine method based on elastic batch processing further includes:

constructing a deep neural network reasoning model; and configuring sub-engines in the deep neural network reasoning engine module according to the deep neural network reasoning model.

The deep neural network reasoning engine module comprises a plurality of sub-engines for processing different batch data sizes, and each sub-engine shares parameters of the deep neural network reasoning model.

In this embodiment, the deep neural network inference engine module is a multi-granularity inference engine composed of a plurality of deep neural network inference sub-engines capable of processing different batch sizes. Each sub-engine in the multi-granularity reasoning engine shares parameters such as weight of the same deep neural network model, the batch processing size of each sub-engine and the number of the sub-engines can be configured and optimized according to the intention of a developer of the application, and when the application developer does not specify the parameters, the deep neural network reasoning engine module is preferably configured according to the elastic batch processing module. The multi-granularity reasoning engine can simultaneously execute a plurality of sub-engines on the graphics processor in parallel, and the sub-engines have mutual influence to determine the processing time of the reasoning request together.

As shown in fig. 3, the present embodiment further provides an inference engine system 100 based on elastic batch processing, where the inference engine system 100 based on elastic batch processing includes: the video memory buffer pool module 110, the inference engine module 120, and the flexible batch scheduling module 130.

That is, the reasoning engine system 100 based on elastic batch processing in this embodiment includes a video memory buffer pool at the graphics processor end, a multi-granularity deep neural network reasoning engine, and an elastic batch processing scheduler. The core of the embodiment is a software system for designing an elastic batch processing deep neural network reasoning engine.

In this embodiment, the video memory buffer pool module 110 is configured to obtain and store the to-be-inferred request data input by the user.

Specifically, the memory buffer pool module 110 is mainly composed of a segment of continuous memory address space allocated at the graphics processing end, where the segment of continuous address space is divided into equally sized memory blocks for storing the input data of each reasoning request to be executed. The module caches the reasoning request to be processed into the video memory of the graphic processor end, and can organize a plurality of different input data of the reasoning request into batch processing input data.

The inference engine module 120 is configured to configure a plurality of sub-engines in the deep neural network inference engine module 120.

The inference engine module 120 is a multi-granularity inference engine that is comprised of a plurality of deep neural network inference sub-engines that can handle different batch sizes. Each sub-engine in the multi-granularity reasoning engine shares parameters such as weight of the same deep neural network model, the batch processing size of each sub-engine and the number of the sub-engines can be configured and optimized according to the intention of a developer of the application, and when the application developer does not specify the parameters, the module is preferably configured according to the elastic batch processing module. The multi-granularity reasoning engine can simultaneously execute a plurality of sub-engines on the graphics processor in parallel, and the sub-engines have mutual influence to determine the processing time of the reasoning request together.

In this embodiment, the elastic batch scheduling module 130 is configured to obtain a maximum number of parallel batches and a number of to-be-inferred requests, organize the to-be-inferred request data into batch data with a suitable batch size according to the maximum number of parallel batches and the number of to-be-inferred requests, and wake up sub-engines corresponding to the batch data in the deep neural network inference engine module 120, where the sub-engines process the to-be-inferred requests.

Specifically, in this embodiment, as shown in fig. 6, the flexible batch scheduling module 130 includes: idle engine management unit 131, batch data processing unit 132, and engine scheduling unit 133.

The idle engine management unit 131 is configured to detect whether an idle sub-engine exists in an engine queue of the deep neural network reasoning engine module 120, and if so, select the number of schedulable to-be-reasoning requests from the video memory buffer pool.

The batch data processing unit 132 is configured to organize the to-be-processed reasoning request data into batch data with a suitable batch size according to the idle sub-engine, the maximum parallel batch number and the to-be-reasoning request number.

The engine scheduling unit 133 is configured to schedule a sub-engine corresponding to the size of the batch data from the idle sub-engines.

The flexible batch scheduling module 130 load coordinates the video memory buffer pool module 110 with the inference engine module 120. In order to meet the quality of service requirements of the overall flexible batch system, the application developer needs to set the longest response time for acceptable inferential request services, from which the flexible batch system gets an acceptable maximum parallel batch size. The elastic batch scheduling module 130 organizes the pending inference request data cached in the video memory buffer pool into batch data of a proper batch size as required according to the obtained maximum parallel batch size and the number of pending inference requests in the current video memory buffer pool. After the data preparation is completed, the flexible batch dispatch module 130 wakes up the sub-engine of the deep neural network inference engine module 120 corresponding to the batch size, which is responsible for processing the inference request of the batch. The scheduling algorithm referred to by the flexible batch scheduling module 130 is specifically shown in table algorithm 1 below.

Table 1 elastic batch scheduling algorithm

/>

As shown in fig. 5 and 6, to further explain the present embodiment principle, the overall execution flow of the present example will be described in detail below.

The flow is as follows:

1) The user builds an inference model: belonging to the multi-granularity inference engine module 120 functionality. The user builds an inference model according to the own inference request.

2) The user specifies the quality of service requirements: belonging to the flexible batch dispatch module 130 function. The user self-defines the processing delay of the acceptable longest depth neural network reasoning request as the service quality requirement according to the own requirement.

3) Setting parameters required by an inference system: belonging to the flexible batch dispatch module 130 function. The elastic batch scheduling module 130 calculates the maximum batch size meeting the user-defined quality of service requirement, and in the following scheduling process, the elastic batch scheduler takes the parameter as the upper limit of the sum of the sub-engine batch sizes in all working, and performs scheduling of the whole inference system according to the parameter. The flexible batch scheduling module 130 also sets the number of sub-engines in the multi-granularity inference engine, the batch size of each sub-engine, and the size of the buffer pool in the video buffer pool module 110.

4) Receiving an inference request: belonging to the function of the video memory buffer pool module 110. Receiving the reasoning request and caching the reasoning request into a buffer pool of the graphics processor side

5) Checking the idle sub-engine queue status: belonging to the flexible batch dispatch module 130 function. Checking the state of an idle sub-engine queue in the multi-granularity reasoning engine, and if no idle sub-engine exists, circularly inquiring the idle sub-engine queue until the idle sub-engine queue exists; if there is an idle sub-engine to proceed to the next step, the scheduling is attempted.

6) Checking the buffer pool state: belonging to the flexible batch dispatch module 130 function. Checking the state of the video memory buffer pool, if no request is made, jumping to 5) checking the state of the idle sub-engine queue; if there is a request to be scheduled, then go to the next step.

7) Selecting a schedulable request from a video memory buffer pool: belonging to the flexible batch dispatch module 130 function. To ensure continuity of the different request data storage locations, the flexible batch scheduling module 130 schedules the most schedulable requests from the existing storage locations from the video memory buffer pool.

8) Scheduling the appropriate sub-engine to process the request: which is a function of the flexible batch dispatch module 130 and the multi-granularity inference engine. The flexible batch scheduling module 130 organizes the inference request data into batch data of suitable size according to idle sub-engines in the multi-granularity inference engine and invokes sub-engines therein to process the data

9) There are schedulable requests and idle sub-engines: belonging to the flexible batch dispatch module 130 function. 7) The selected schedulable request in 8) may not be completed at one time by the single sub-engine process scheduled in 8). If there are schedulable requests and there are idle sub-engines, jump to 8) continue scheduling; otherwise, jumping to 5) repeating the scheduling process.

Embodiments of the present invention also provide an electronic device, which is, but not limited to, a medical detection device, an image processing device, etc., as shown in fig. 7, the electronic device processor 1101 and memory 1102; the memory 1102 is connected to the processor 1101 through a system bus and performs communication with each other, the memory 1102 is used for storing a computer program, and the processor 1101 is used for running the computer program to cause the electronic device to execute the reasoning engine method based on elastic batch processing. The above description of the reasoning engine method based on elastic batch processing has been described in detail, and will not be repeated here.

The reasoning engine method based on elastic batch processing can be applied to various types of electronic equipment. The electronic device is, for example, a controller, such as, in particular, a ARM (Advanced RISC Machines) controller, a FPGA (Field Programmable Gate Array) controller, a SoC (System on Chip) controller, a DSP (Digital Signal Processing) controller, or a MCU (Micorcontroller Unit) controller, or the like. The electronic device may also be, for example, a computer including memory, a memory controller, one or more processing units (CPUs), peripheral interfaces, RF circuitry, audio circuitry, speakers, microphones, input/output (I/O) subsystems, display screens, other output or control devices, and external ports; the computer includes, but is not limited to, a personal computer such as a desktop computer, a notebook computer, a tablet computer, a smart phone, a smart television, a personal digital assistant (Personal Digital Assistant, PDA for short), and the like. In other embodiments, the electronic device may also be a server, where the server may be disposed on one or more physical servers according to a plurality of factors such as functions, loads, and the like, or may be formed by a distributed or centralized server cluster, which is not limited in this embodiment.

In an actual implementation manner, the electronic device is an electronic device that installs an Android operating system or an iOS operating system, or Palm OS, symbian (plug) or Black Berry OS, windows Phone, or other operating systems.

In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, cameras or other electronic elements for performing the above-described flexible batch-based inference engine method.

It should be noted that the system bus mentioned above may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus. The system bus may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus. The communication interface is used to enable communication between the database access system and other devices (e.g., clients, read-write libraries, and read-only libraries). The memory may comprise random access memory (Random Access Memory, RAM) and may also comprise non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 1101 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In addition, the embodiment also provides a computer readable storage medium, on which a computer program is stored, the computer program implementing the reasoning engine method based on elastic batch processing when being executed by a processor. The above description of the reasoning engine method based on elastic batch processing has been described in detail, and will not be repeated here.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by computer program related hardware. The aforementioned computer program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

In summary, the present invention establishes a system including three aspects of the video memory buffer pool module 110, the multi-granularity inference engine module 120 and the elastic batch processing scheduling module 130, and realizes a deep neural network inference engine system with low delay, high throughput and based on elastic batch processing, so that the response delay speed and throughput of the whole engine system are maximized without increasing hardware devices including graphics processors and the like; the achievement of the invention can provide support for the emerging deep neural network technology landing, and can lead the construction of a neural network reasoning task scheduling system with commercial significance and based on the elastic batch processing technology, and the optimization of the neural network reasoning scheduling service is simplified for users. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims

1. An elastic batch-based reasoning engine method, characterized in that the elastic batch-based reasoning engine method comprises the following steps:

acquiring to-be-inferred request data input by a user;

obtaining the maximum parallel batch processing quantity and the quantity of requests to be inferred;

organizing the data of the request to be inferred into batch data with proper batch size according to the maximum parallel batch number and the number of the request to be inferred, waking up a sub-engine corresponding to the batch data in a deep neural network inference engine module, and processing the request to be inferred by the sub-engine;

further comprises:

storing the data to be inferred request input by the user into a video memory buffer pool formed by continuous video memory address space;

the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing data to be inferred and requested;

the reasoning engine method based on elastic batch processing further comprises the following steps:

constructing a deep neural network reasoning model;

configuring sub-engines in the deep neural network reasoning engine module according to the deep neural network reasoning model;

2. The flexible batch-based reasoning engine method of claim 1, wherein the maximum number of parallel batches is obtained based on a longest response time for processing a reasoning request.

3. The flexible batch-based reasoning engine method of claim 1, wherein organizing the request data to be reasoning into batch data of a suitable batch size as required and waking up sub-engines of a deep neural network reasoning engine module corresponding to the batch data comprises:

detecting whether an idle sub-engine exists in an engine queue of a deep neural network reasoning engine module, and if so, selecting the number of schedulable requests to be reasoning from the video memory buffer pool;

organizing the data of the request to be inferred into batch data with proper batch size according to the idle sub-engine, the maximum parallel batch number and the number of the request to be inferred;

scheduling a sub-engine corresponding to the size of the batch data from the idle sub-engines.

4. A method as claimed in claim 3, wherein if a single sub-engine does not process the request to be inferred at a time, sub-engines corresponding to the remaining requests to be inferred are continuously acquired from idle sub-engines, and the remaining requests to be inferred are continuously processed.

5. An elastic batch-based reasoning engine system, comprising:

the video memory buffer pool module is used for acquiring and storing to-be-inferred request data input by a user; also used for: storing the data to be inferred request input by the user into a video memory buffer pool formed by continuous video memory address space; the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing data to be inferred and requested;

the reasoning engine module is used for configuring a plurality of sub-engines in the deep neural network reasoning engine module; also used for: constructing a deep neural network reasoning model; configuring sub-engines in the deep neural network reasoning engine module according to the deep neural network reasoning model; the deep neural network reasoning engine module comprises a plurality of sub-engines for processing different batch data sizes, and each sub-engine shares parameters of the deep neural network reasoning model;

the elastic batch processing scheduling module is used for acquiring the maximum parallel batch processing quantity and the quantity of the to-be-inferred requests, organizing the to-be-inferred request data into batch processing data with proper batch processing size according to the maximum parallel batch processing quantity and the quantity of the to-be-inferred requests, and waking up sub-engines corresponding to the batch processing data in the deep neural network inference engine module, wherein the sub-engines process the to-be-inferred requests.

6. An elastic batch based reasoning engine system as recited in claim 5, wherein the maximum number of parallel batches is obtained based on a longest response time for processing a reasoning request.

7. An elastic batch based inference engine system as defined in claim 5, wherein said elastic batch scheduling module comprises:

the idle engine management unit is used for detecting whether idle sub-engines exist in an engine queue of the deep neural network reasoning engine module, and if so, the number of schedulable to-be-reasoning requests is selected from the video memory buffer pool;

the batch processing data processing unit is used for organizing the data to be inferred to batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the quantity of the requests to be inferred;

and the engine scheduling unit is used for scheduling the sub-engine corresponding to the size of the batch processing data from the idle sub-engines.

8. An electronic device comprising a processor and a memory, the memory storing program instructions that execute the processor to implement the resilient batch-based inference engine method of any one of claims 1 to 4.