CN110837419A

CN110837419A - Inference engine system and method based on elastic batch processing and electronic equipment

Info

Publication number: CN110837419A
Application number: CN201911088741.9A
Authority: CN
Inventors: 陈�全; 过敏意; 崔炜皞; 沈耀; 姚斌
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-02-25
Anticipated expiration: 2039-11-08
Also published as: CN110837419B

Abstract

The invention provides a reasoning engine system and a method based on elastic batch processing and electronic equipment, wherein the reasoning engine method based on elastic batch processing comprises the following steps: acquiring data of a request to be reasoned, which is input by a user; acquiring the maximum parallel batch processing quantity and the quantity of the requests to be inferred; organizing the data of the inference requests to be processed into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the number of the inference requests to be processed, awakening a sub-engine corresponding to the batch processing data in a deep neural network inference engine module, and processing the inference requests to be processed by the sub-engine. The present invention maximizes the response delay speed and throughput of an engine system without adding hardware devices including a graphics processor and the like.

Description

Inference engine system and method based on elastic batch processing and electronic equipment

Technical Field

The invention relates to the technical field of processors, in particular to the technical field of GPUs (graphic processing units), and specifically relates to an inference engine system and method based on elastic batch processing and electronic equipment.

Background

With the deployment of large numbers of compute-intensive applications such as speech recognition, machine translation, personal private assistants, etc., mainstream private data centers or public cloud platforms have begun to heavily use coprocessors like GPUs to deal with the problem of insufficient computing power of traditional CPUs. GPUs were originally dedicated processors designed for graphics image computation, and because of their incomparable parallelism with conventional CPUs, more and more non-graphics image applications are migrating to GPU platforms to meet their rapidly increasing computational demands. However, studies have shown that non-graphics image applications often do not have sufficient parallelism to fully utilize the hardware resources of the GPU, resulting in a waste of hardware resources. On the other hand, due to the development of GPU architecture and process, more and more Stream Multiprocessors (SMs) are integrated into one GPU, so that the problem of resource waste is more prominent.

The GPU has been widely applied to provide an online service based on deep learning, and how to improve the performance of the GPU, including lower response delay, higher system throughput, and the like, becomes a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide a flexible batch based inference engine system, method and electronic device for improving the performance of a GPU processor engine.

To achieve the above and other related objects, an embodiment of the present invention provides an inference engine method based on elastic batch processing, including: acquiring data of a request to be reasoned, which is input by a user; acquiring the maximum parallel batch processing quantity and the quantity of the requests to be inferred; organizing the data of the inference requests to be processed into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the number of the inference requests to be processed, awakening a sub-engine corresponding to the batch processing data in a deep neural network inference engine module, and processing the inference requests to be processed by the sub-engine.

In an embodiment of the present application, the inference engine method based on elastic batch processing further includes: constructing a deep neural network reasoning model; configuring a sub-engine in the deep neural network inference engine module according to a deep neural network inference model; the deep neural network inference engine module comprises a plurality of sub-engines for processing different batch processing data sizes, and the sub-engines share the parameters of the deep neural network inference model.

In an embodiment of the present application, the inference engine method based on elastic batch processing further includes: storing the data of the request to be inferred input by the user into a video memory buffer pool consisting of continuous video memory address spaces; the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing one piece of data to be reasoned and requested.

In one embodiment of the present application, the maximum number of parallel batches is obtained according to the longest response time for processing the inference request.

In an embodiment of the application, organizing the to-be-processed inference request data into batch data of an appropriate batch size as needed, and waking up a sub-engine in the deep neural network inference engine module corresponding to the batch data size includes: detecting whether idle sub-engines exist in an engine queue of a deep neural network inference engine module, and if so, selecting the number of schedulable requests to be inferred from the video memory buffer pool; organizing the inference request data to be processed into batch processing data with proper batch processing size according to the idle sub-engines, the maximum parallel batch processing number and the number of the inference requests to be processed; and scheduling the sub-engine corresponding to the size of the batch processing data from the idle sub-engines.

In an embodiment of the present application, if a single sub-engine has not processed the request to be inferred once, the sub-engines corresponding to the remaining requests to be inferred are continuously obtained from the idle sub-engines, and the remaining requests to be inferred are continuously processed.

The embodiment of the present invention further provides an inference engine system based on elastic batch processing, including: the video memory buffer pool module is used for acquiring and storing the data of the request to be reasoned, which is input by a user; the inference engine module is used for configuring a plurality of sub-engines in the deep neural network inference engine module; the flexible batch processing scheduling module is used for acquiring the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, organizing the to-be-processed reasoned request data into batch processing data with a proper batch processing size according to the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, awakening a sub-engine corresponding to the batch processing data in the deep neural network reasoned engine module, and processing the to-be-processed reasoned requests by the sub-engine.

In an embodiment of the present application, the flexible batch scheduling module includes: the idle engine management unit is used for detecting whether idle sub-engines exist in an engine queue of the deep neural network inference engine module, and if the idle sub-engines exist in the engine queue, selecting the number of schedulable requests to be inferred from the video memory buffer pool; the batch processing data processing unit is used for organizing the inference request data to be processed into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the quantity of the inference requests to be processed; and the engine scheduling unit is used for scheduling the sub-engines corresponding to the batch processing data from the idle sub-engines.

Embodiments of the present invention also provide an electronic device, which includes a processor and a memory, where the memory stores program instructions, and the processor executes the program instructions to implement the inference engine method based on elastic batch processing as described above.

As described above, the inference engine system, method and electronic device based on flexible batch processing according to the present invention have the following advantages:

1. the invention establishes a system comprising a video memory buffer pool module, a multi-granularity reasoning engine module and an elastic batch processing scheduling module, realizes a deep neural network reasoning engine system with low delay, high throughput and based on elastic batch processing, and maximizes the response delay speed and throughput of the whole engine system on the premise of not increasing hardware equipment including a graphics processor and the like.

2. The achievement of the invention can provide support for the emerging deep neural network technology landing, and the achievement of the invention can construct a neural network inference task scheduling system which has commercial significance and is based on the elastic batch processing technology, and the optimization of the neural network inference scheduling service is simplified facing users.

Drawings

FIG. 1 is a schematic flow chart of the inference engine method based on flexible batch processing according to the present invention.

FIG. 2 is a flow chart illustrating a scheduling sub-engine in the flexible batch reasoning engine method according to the present invention.

Fig. 3 and 4 are schematic diagrams illustrating an embodiment of the inference engine method based on elastic batch processing according to the present invention.

FIG. 5 is a block diagram illustrating the overall schematic structure of the flexible batch based inference engine system of the present invention.

FIG. 6 is a block diagram of the flexible batch scheduling module in the flexible batch based inference engine system of the present invention.

Fig. 7 is a schematic structural diagram of an electronic terminal according to an embodiment of the present application.

Description of the element reference numerals

100 reasoning engine system based on elastic batch processing

110 video memory buffer pool module

120 reasoning engine module

130 flexible batch scheduling module

131 idle engine management unit

132 batch processing data processing unit

133 engine scheduling unit

1101 processor

1102 memory

S110 to S130

S131 to S134 steps

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The present embodiment aims to provide an inference engine system, method and electronic device based on flexible batch processing, which are used for improving the performance of a GPU processor engine.

The embodiment aims to design and realize a deep neural network inference engine system with low delay, high throughput and based on elastic batch processing, and provides three modules, including: the video memory buffer pool, the multi-granularity deep neural network inference engine and the elastic batch processing scheduler at the graphic processor end respectively complete decoupling of calculation and data transmission, parallel execution of batch processing tasks of different sizes and batch processing of different sizes of elastic organizations, so that application developers can conveniently land on the ground to obtain better performance with less parameter configuration, including lower response delay and higher system throughput.

The principle and implementation of the flexible batch process-based reasoning engine system, method and electronic device of the present invention will be described in detail below, so that those skilled in the art can understand the flexible batch process-based reasoning engine system, method and electronic device without creative efforts.

As shown in fig. 1, the present embodiment provides an inference engine method based on elastic batch processing, which includes the following steps:

step S110, acquiring data of a request to be reasoned, which is input by a user;

step S120, acquiring the maximum parallel batch processing quantity and the quantity of the requests to be reasoned;

step S130, organizing the data of the inference request to be processed into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the number of the inference request to be processed, awakening a sub-engine corresponding to the batch processing data in the deep neural network inference engine module, and processing the inference request to be processed by the sub-engine.

The following describes steps S110 to S130 of the inference engine method based on flexible batch processing according to the present embodiment in detail.

Step S110, acquiring data of the request to be reasoned, which is input by the user.

In this embodiment, the inference engine method based on elastic batch processing further includes: storing the data of the request to be inferred input by the user into a video memory buffer pool consisting of continuous video memory address spaces; the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing one piece of data to be reasoned and requested.

Specifically, in this embodiment, the video memory buffer pool mainly includes a segment of continuous video memory address space allocated at the graphics processing end, and the segment of continuous address space is divided into video memory blocks with the same size for storing input data of each inference request to be executed (i.e., data of the inference request to be executed). The inference request to be processed is cached in a video memory of the graphic processor, and in addition, a plurality of different input data of the inference request to be processed can be organized into batch input data.

And step S120, acquiring the maximum parallel batch processing quantity and the quantity of the requests to be inferred.

In this embodiment, the maximum number of parallel batches is obtained based on the longest response time to process the inference request.

In order to meet the service quality requirement of the whole flexible batch processing, an application developer needs to set the longest response time of the acceptable inference request service, so as to obtain the acceptable maximum parallel batch processing size, namely, a user customizes the acceptable processing delay of the longest deep neural network inference request according to the requirement of the user as the service quality requirement. In the subsequent scheduling process, the parameter of the longest response time for processing the inference request is used as the upper limit of the sum of the batch processing sizes of all the sub-engines in work, and the scheduling of the whole system is carried out according to the upper limit.

Specifically, as shown in fig. 2, in this embodiment, the organizing the to-be-processed inference request data into batch data of an appropriate batch size as needed, and waking up a sub-engine in the deep neural network inference engine module corresponding to the batch data size includes:

step S131, whether idle sub-engines exist in an engine queue of the deep neural network inference engine module is detected, if yes, step S132 is executed, the number of schedulable requests to be inferred is selected from the video memory buffer pool, and if not, whether idle sub-engines exist in the engine queue of the deep neural network inference engine module is continuously detected.

Checking idle sub-engine queue status: checking the state of an idle sub-engine queue in the multi-granularity reasoning engine, and if no idle sub-engine exists, circularly inquiring the idle sub-engine queue until the idle sub-engine queue exists; if the sub-engine is idle, the next step is carried out, and scheduling is tried.

Step S133, organizing the inference request data to be processed into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing number and the number of the inference requests to be processed;

checking the buffer pool state: checking the state of the video memory buffer pool, and returning to continuously detect the idle sub-engine queue state if no request exists; and if the requests are to be scheduled, selecting the schedulable requests from the video memory buffer pool.

Selecting a schedulable request from a video memory buffer pool: in order to guarantee the continuity of different request data storage positions, the requests which can be dispatched at most from the existing storage positions are dispatched from the video memory buffer pool.

Step S134, scheduling a sub-engine corresponding to the size of the batch data from the idle sub-engines.

Scheduling the appropriate sub-engine to process the request: and organizing the inference request data into batch processing data with proper size according to idle sub-engines in the multi-granularity inference engine, and calling the sub-engines to process the data.

In this embodiment, if a single sub-engine does not process the request to be inferred once, the sub-engines corresponding to the remaining requests to be inferred are continuously obtained from the idle sub-engines, and the remaining requests to be inferred are continuously processed.

There are schedulable request and idle sub-engines: the selected schedulable request may not be able to be processed by a single sub-engine that is scheduled at a time. If there are still schedulable requests and there are still idle sub-engines, go to step S134 to continue scheduling; otherwise, jumping to step S131 to repeat the scheduling process.

In this embodiment, the inference engine method based on elastic batch processing further includes:

constructing a deep neural network reasoning model; and configuring a sub-engine in the deep neural network inference engine module according to the deep neural network inference model.

The deep neural network inference engine module comprises a plurality of sub-engines for processing different batch processing data sizes, and the sub-engines share the parameters of the deep neural network inference model.

In this embodiment, the deep neural network inference engine module is a multi-granularity inference engine composed of a plurality of deep neural network inference sub-engines capable of processing different batch processing sizes. Each sub-engine in the multi-granularity reasoning engine shares parameters such as weight of the same deep neural network model, the batch processing size and the number of the sub-engines can be configured and optimized according to the will of an application developer, and when the parameters are not specified by the application developer, the deep neural network reasoning engine module is preferentially configured according to the elastic batch processing module. The multi-granularity reasoning engine can simultaneously execute a plurality of sub-engines on the graphics processor in parallel, and the sub-engines have mutual influence and jointly determine the processing time of the reasoning request.

As shown in fig. 3, the present embodiment further provides an inference engine system 100 based on elastic batch processing, where the inference engine system 100 based on elastic batch processing includes: a video memory buffer pool module 110, an inference engine module 120 and an elastic batch scheduling module 130.

That is, the inference engine system 100 based on elastic batch processing in this embodiment includes a video memory buffer pool at the graphics processor end, a multi-granularity deep neural network inference engine, and an elastic batch processing scheduler. The core of the embodiment is a software system for designing an elastic batch processing deep neural network inference engine.

In this embodiment, the video memory buffer pool module 110 is configured to obtain and store data of a request to be inferred input by a user.

Specifically, the video memory buffer pool module 110 is mainly composed of a segment of continuous video memory address space allocated at the graphics processing end, and the segment of continuous address space is divided into video memory blocks with the same size for storing the input data of each inference request to be executed. The module caches the inference request to be processed in a video memory of the graphics processor, and can also organize input data of a plurality of different inference requests into batch input data.

The inference engine module 120 is used to configure a plurality of sub-engines in the deep neural network inference engine module 120.

The inference engine module 120 is a multi-granularity inference engine composed of a plurality of deep neural network inference sub-engines capable of handling different batch processing sizes. Each sub-engine in the multi-granularity reasoning engine shares parameters such as weight of the same deep neural network model, the batch processing size and the number of the sub-engines can be configured and optimized according to the will of an application developer, and when the parameters are not specified by the application developer, the module is preferentially configured according to the elastic batch processing module. The multi-granularity reasoning engine can simultaneously execute a plurality of sub-engines on the graphics processor in parallel, and the sub-engines have mutual influence and jointly determine the processing time of the reasoning request.

In this embodiment, the flexible batch scheduling module 130 is configured to obtain the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, organize the to-be-processed reasoned request data into batch processing data with an appropriate batch processing size as needed according to the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, and wake up a sub-engine corresponding to the batch processing data in the deep neural network inference engine module 120, so that the sub-engine processes the to-be-processed reasoned requests.

Specifically, in this embodiment, as shown in fig. 6, the flexible batch scheduling module 130 includes: an idle engine management unit 131, a batch data processing unit 132, and an engine scheduling unit 133.

The idle engine management unit 131 is configured to detect whether an idle sub-engine exists in an engine queue of the deep neural network inference engine module 120, and if so, select the number of schedulable requests to be inferred from the video memory buffer pool.

The batch processing data processing unit 132 is configured to organize the inference request data to be processed into batch processing data of an appropriate batch processing size as needed according to the idle sub-engines, the maximum parallel batch processing number, and the number of the inference requests to be processed.

The engine scheduling unit 133 is configured to schedule a sub-engine corresponding to the size of the batch data from the idle sub-engines.

The flexible batch scheduling module 130 load coordinates the video memory buffer pool module 110 and the inference engine module 120. To meet the quality of service requirements of the entire flexible batch system, the application developer needs to set the maximum response time for the acceptable inference request service, from which the flexible batch system gets the maximum parallel batch size that is acceptable. The flexible batch scheduling module 130 organizes the data of the inference requests to be processed cached in the video memory buffer pool into batch data of a proper batch size as required according to the obtained maximum parallel batch size and the number of the inference requests to be processed in the current video memory buffer pool. After the data preparation is completed, the flexible batch scheduling module 130 wakes up the sub-engine of the deep neural network inference engine module 120 corresponding to the batch size, and the sub-engine is responsible for processing the inference request of the batch. The scheduling algorithm involved in the flexible batch scheduling module 130 is specifically shown in the following table, algorithm 1.

TABLE 1 Flexible batch scheduling Algorithm

As shown in fig. 5 and fig. 6, in order to further illustrate the implementation principle, the overall execution flow of the present example will be described in detail below.

The process comprises the following steps:

1) the user constructs an inference model: belonging to the multi-granularity inference engine module 120 function. And the user constructs an inference model according to the inference request of the user.

2) User specified quality of service requirements: belonging to the flexible batch scheduling module 130 function. And the user self-defines the acceptable processing delay of the longest deep neural network inference request according to the requirement of the user as the service quality requirement.

3) Setting parameters required by an inference system: belonging to the flexible batch scheduling module 130 function. The flexible batch scheduling module 130 calculates the maximum batch size that can meet the user-defined service quality requirement, and in the subsequent scheduling process, the flexible batch scheduler uses the parameter as the upper limit of the sum of the batch sizes of all the sub-engines in work, and performs scheduling of the whole inference system according to the upper limit. Meanwhile, the flexible batch scheduling module 130 sets the number of sub-engines in the multi-granularity inference engine, the batch processing size of each sub-engine, and the size of the buffer pool in the video memory buffer pool module 110.

4) Receiving an inference request: belonging to the function of the video memory buffer pool module 110. Receiving inference request, and caching to buffer pool of graphics processor end

5) Checking idle sub-engine queue status: belonging to the flexible batch scheduling module 130 function. Checking the state of an idle sub-engine queue in the multi-granularity reasoning engine, and if no idle sub-engine exists, circularly inquiring the idle sub-engine queue until the idle sub-engine queue exists; if the sub-engine is idle, the next step is carried out, and scheduling is tried.

6) Checking the buffer pool state: belonging to the flexible batch scheduling module 130 function. Checking the state of a video memory buffer pool, and jumping to 5) checking the idle sub-engine queue state if no request exists; and if the request is to be scheduled, the next step is carried out.

7) Selecting a schedulable request from a video memory buffer pool: belonging to the flexible batch scheduling module 130 function. In order to ensure the continuity of the data storage locations of different requests, the flexible batch scheduling module 130 schedules the request selected from the existing storage locations from the video memory buffer pool to be schedulable at most.

8) Scheduling the appropriate sub-engine to process the request: belonging to the functions of the flexible batch scheduling module 130 and the multi-granularity inference engine. The flexible batch scheduling module 130 organizes the inference request data into batch data of a suitable size according to idle sub-engines in the multi-granularity inference engine and calls the sub-engines therein to process the data

9) There are schedulable request and idle sub-engines: belonging to the flexible batch scheduling module 130 function. 7) The schedulable request selected in (8) may not be able to be processed by a single sub-engine scheduled at a time. If the schedulable request still exists and the idle sub-engine still exists, skipping to 8) to continue scheduling; otherwise, jumping to 5) and repeating the scheduling process.

Embodiments of the present invention also provide an electronic device, such as but not limited to a medical examination device, an image processing device, etc., as shown in fig. 7, the electronic device processor 1101 and the memory 1102; the memory 1102 is connected to the processor 1101 through a system bus and is used for storing a computer program, and the processor 1101 is used for running the computer program, so that the electronic device executes the inference engine method based on elastic batch processing. The inference engine method based on flexible batch processing has been described in detail above, and is not described herein again.

The flexible batch process based inference engine approach can be applied to many types of electronic devices. The electronic device is, for example, a controller, such as an arm (advanced RISC machines) controller, an fpga (field programmable Gate array) controller, a soc (system on chip) controller, a dsp (digital signal processing) controller, or an mcu (micro controller unit) controller. The electronic device may also be, for example, a computer that includes components such as memory, a memory controller, one or more processing units (CPUs), a peripheral interface, RF circuitry, audio circuitry, speakers, a microphone, an input/output (I/O) subsystem, a display screen, other output or control devices, and external ports; the computer includes, but is not limited to, Personal computers such as desktop computers, notebook computers, tablet computers, smart phones, smart televisions, Personal Digital Assistants (PDAs), and the like. In other embodiments, the electronic device may also be a server, and the server may be arranged on one or more entity servers according to various factors such as functions, loads, and the like, or may be formed by a distributed or centralized server cluster, which is not limited in this embodiment.

In an actual implementation manner, the electronic device is, for example, an electronic device installed with an Android operating system or an iOS operating system, or an operating system such as Palm OS, Symbian, Black Berry OS, or Windows Phone.

In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, cameras, or other electronic components for performing the above-described flexible batch process based inference engine method.

It should be noted that the above-mentioned system bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus. The communication interface is used for realizing communication between the database access system and other devices (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor 1101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

Furthermore, the present embodiment also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the elastic batch processing based inference engine method. The inference engine method based on flexible batch processing has been described in detail above, and is not described herein again.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

In summary, the present invention establishes a system including three aspects of the video memory buffer pool module 110, the multi-granularity inference engine module 120, and the flexible batch scheduling module 130, so as to implement a deep neural network inference engine system with low latency and high throughput based on flexible batch processing, and maximize the response latency speed and throughput of the whole engine system without adding hardware devices including a graphics processor; the achievement of the invention can provide support for the emerging deep neural network technology landing, and the achievement of the invention can construct a neural network inference task scheduling system which has commercial significance and is based on the elastic batch processing technology, and the optimization of the neural network inference scheduling service is simplified facing users. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. An elastic batch processing based reasoning engine method, characterized in that the elastic batch processing based reasoning engine method comprises:

acquiring data of a request to be reasoned, which is input by a user;

acquiring the maximum parallel batch processing quantity and the quantity of the requests to be inferred;

organizing the data of the inference requests to be processed into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the number of the inference requests to be processed, awakening a sub-engine corresponding to the batch processing data in a deep neural network inference engine module, and processing the inference requests to be processed by the sub-engine.

2. The elastic batch based reasoning engine method according to claim 1, further comprising:

constructing a deep neural network reasoning model;

configuring a sub-engine in the deep neural network inference engine module according to a deep neural network inference model;

3. The elastic batch based reasoning engine method according to claim 1, further comprising:

storing the data of the request to be inferred input by the user into a video memory buffer pool consisting of continuous video memory address spaces;

the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing one piece of data to be reasoned and requested.

4. The flexible batch based reasoning engine method of claim 1, wherein the maximum number of parallel batches is obtained based on a longest response time for processing the reasoning request.

5. The flexible batch based reasoning engine method of claim 3, wherein the organizing the inference request data to be processed into batch data of a suitable batch size on demand and waking up a sub-engine corresponding to the batch data size in the deep neural network reasoning engine module comprises:

detecting whether idle sub-engines exist in an engine queue of a deep neural network inference engine module, and if so, selecting the number of schedulable requests to be inferred from the video memory buffer pool;

organizing the inference request data to be processed into batch processing data with proper batch processing size according to the idle sub-engines, the maximum parallel batch processing number and the number of the inference requests to be processed;

and scheduling the sub-engine corresponding to the size of the batch processing data from the idle sub-engines.

6. The flexible batch processing-based inference engine method of claim 5, wherein if a single sub-engine has not processed the request to be inferred at one time, the sub-engine corresponding to the remaining request to be inferred is continuously obtained from the idle sub-engine, and the remaining request to be inferred is continuously processed.

7. An elastic batch based reasoning engine system, characterized in that the elastic batch based reasoning engine system comprises:

the video memory buffer pool module is used for acquiring and storing the data of the request to be reasoned, which is input by a user;

the inference engine module is used for configuring a plurality of sub-engines in the deep neural network inference engine module;

the flexible batch processing scheduling module is used for acquiring the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, organizing the to-be-processed reasoned request data into batch processing data with a proper batch processing size according to the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, awakening a sub-engine corresponding to the batch processing data in the deep neural network reasoned engine module, and processing the to-be-processed reasoned requests by the sub-engine.

8. The flexible batch based reasoning engine system of claim 7, wherein the maximum number of parallel batches is derived from a longest response time for processing the reasoning request.

9. The flexible batch based reasoning engine system of claim 7, wherein the flexible batch scheduling module comprises:

the idle engine management unit is used for detecting whether idle sub-engines exist in an engine queue of the deep neural network inference engine module, and if the idle sub-engines exist in the engine queue, selecting the number of schedulable requests to be inferred from the video memory buffer pool;

the batch processing data processing unit is used for organizing the inference request data to be processed into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the quantity of the inference requests to be processed;

and the engine scheduling unit is used for scheduling the sub-engines corresponding to the batch processing data from the idle sub-engines.

10. An electronic device comprising a processor and a memory, the memory storing program instructions, the processor executing the program instructions to implement the elastic batch process based inference engine method of any of claims 1 to 6.