CN110837419A - Inference engine system and method based on elastic batch processing and electronic equipment - Google Patents

Inference engine system and method based on elastic batch processing and electronic equipment Download PDF

Info

Publication number
CN110837419A
CN110837419A CN201911088741.9A CN201911088741A CN110837419A CN 110837419 A CN110837419 A CN 110837419A CN 201911088741 A CN201911088741 A CN 201911088741A CN 110837419 A CN110837419 A CN 110837419A
Authority
CN
China
Prior art keywords
engine
batch processing
sub
inference
batch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911088741.9A
Other languages
Chinese (zh)
Other versions
CN110837419B (en
Inventor
陈�全
过敏意
崔炜皞
沈耀
姚斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201911088741.9A priority Critical patent/CN110837419B/en
Publication of CN110837419A publication Critical patent/CN110837419A/en
Application granted granted Critical
Publication of CN110837419B publication Critical patent/CN110837419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a reasoning engine system and a method based on elastic batch processing and electronic equipment, wherein the reasoning engine method based on elastic batch processing comprises the following steps: acquiring data of a request to be reasoned, which is input by a user; acquiring the maximum parallel batch processing quantity and the quantity of the requests to be inferred; organizing the data of the inference requests to be processed into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the number of the inference requests to be processed, awakening a sub-engine corresponding to the batch processing data in a deep neural network inference engine module, and processing the inference requests to be processed by the sub-engine. The present invention maximizes the response delay speed and throughput of an engine system without adding hardware devices including a graphics processor and the like.

Description

Inference engine system and method based on elastic batch processing and electronic equipment
Technical Field
The invention relates to the technical field of processors, in particular to the technical field of GPUs (graphic processing units), and specifically relates to an inference engine system and method based on elastic batch processing and electronic equipment.
Background
With the deployment of large numbers of compute-intensive applications such as speech recognition, machine translation, personal private assistants, etc., mainstream private data centers or public cloud platforms have begun to heavily use coprocessors like GPUs to deal with the problem of insufficient computing power of traditional CPUs. GPUs were originally dedicated processors designed for graphics image computation, and because of their incomparable parallelism with conventional CPUs, more and more non-graphics image applications are migrating to GPU platforms to meet their rapidly increasing computational demands. However, studies have shown that non-graphics image applications often do not have sufficient parallelism to fully utilize the hardware resources of the GPU, resulting in a waste of hardware resources. On the other hand, due to the development of GPU architecture and process, more and more Stream Multiprocessors (SMs) are integrated into one GPU, so that the problem of resource waste is more prominent.
The GPU has been widely applied to provide an online service based on deep learning, and how to improve the performance of the GPU, including lower response delay, higher system throughput, and the like, becomes a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide a flexible batch based inference engine system, method and electronic device for improving the performance of a GPU processor engine.
To achieve the above and other related objects, an embodiment of the present invention provides an inference engine method based on elastic batch processing, including: acquiring data of a request to be reasoned, which is input by a user; acquiring the maximum parallel batch processing quantity and the quantity of the requests to be inferred; organizing the data of the inference requests to be processed into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the number of the inference requests to be processed, awakening a sub-engine corresponding to the batch processing data in a deep neural network inference engine module, and processing the inference requests to be processed by the sub-engine.
In an embodiment of the present application, the inference engine method based on elastic batch processing further includes: constructing a deep neural network reasoning model; configuring a sub-engine in the deep neural network inference engine module according to a deep neural network inference model; the deep neural network inference engine module comprises a plurality of sub-engines for processing different batch processing data sizes, and the sub-engines share the parameters of the deep neural network inference model.
In an embodiment of the present application, the inference engine method based on elastic batch processing further includes: storing the data of the request to be inferred input by the user into a video memory buffer pool consisting of continuous video memory address spaces; the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing one piece of data to be reasoned and requested.
In one embodiment of the present application, the maximum number of parallel batches is obtained according to the longest response time for processing the inference request.
In an embodiment of the application, organizing the to-be-processed inference request data into batch data of an appropriate batch size as needed, and waking up a sub-engine in the deep neural network inference engine module corresponding to the batch data size includes: detecting whether idle sub-engines exist in an engine queue of a deep neural network inference engine module, and if so, selecting the number of schedulable requests to be inferred from the video memory buffer pool; organizing the inference request data to be processed into batch processing data with proper batch processing size according to the idle sub-engines, the maximum parallel batch processing number and the number of the inference requests to be processed; and scheduling the sub-engine corresponding to the size of the batch processing data from the idle sub-engines.
In an embodiment of the present application, if a single sub-engine has not processed the request to be inferred once, the sub-engines corresponding to the remaining requests to be inferred are continuously obtained from the idle sub-engines, and the remaining requests to be inferred are continuously processed.
The embodiment of the present invention further provides an inference engine system based on elastic batch processing, including: the video memory buffer pool module is used for acquiring and storing the data of the request to be reasoned, which is input by a user; the inference engine module is used for configuring a plurality of sub-engines in the deep neural network inference engine module; the flexible batch processing scheduling module is used for acquiring the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, organizing the to-be-processed reasoned request data into batch processing data with a proper batch processing size according to the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, awakening a sub-engine corresponding to the batch processing data in the deep neural network reasoned engine module, and processing the to-be-processed reasoned requests by the sub-engine.
In one embodiment of the present application, the maximum number of parallel batches is obtained according to the longest response time for processing the inference request.
In an embodiment of the present application, the flexible batch scheduling module includes: the idle engine management unit is used for detecting whether idle sub-engines exist in an engine queue of the deep neural network inference engine module, and if the idle sub-engines exist in the engine queue, selecting the number of schedulable requests to be inferred from the video memory buffer pool; the batch processing data processing unit is used for organizing the inference request data to be processed into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the quantity of the inference requests to be processed; and the engine scheduling unit is used for scheduling the sub-engines corresponding to the batch processing data from the idle sub-engines.
Embodiments of the present invention also provide an electronic device, which includes a processor and a memory, where the memory stores program instructions, and the processor executes the program instructions to implement the inference engine method based on elastic batch processing as described above.
As described above, the inference engine system, method and electronic device based on flexible batch processing according to the present invention have the following advantages:
1. the invention establishes a system comprising a video memory buffer pool module, a multi-granularity reasoning engine module and an elastic batch processing scheduling module, realizes a deep neural network reasoning engine system with low delay, high throughput and based on elastic batch processing, and maximizes the response delay speed and throughput of the whole engine system on the premise of not increasing hardware equipment including a graphics processor and the like.
2. The achievement of the invention can provide support for the emerging deep neural network technology landing, and the achievement of the invention can construct a neural network inference task scheduling system which has commercial significance and is based on the elastic batch processing technology, and the optimization of the neural network inference scheduling service is simplified facing users.
Drawings
FIG. 1 is a schematic flow chart of the inference engine method based on flexible batch processing according to the present invention.
FIG. 2 is a flow chart illustrating a scheduling sub-engine in the flexible batch reasoning engine method according to the present invention.
Fig. 3 and 4 are schematic diagrams illustrating an embodiment of the inference engine method based on elastic batch processing according to the present invention.
FIG. 5 is a block diagram illustrating the overall schematic structure of the flexible batch based inference engine system of the present invention.
FIG. 6 is a block diagram of the flexible batch scheduling module in the flexible batch based inference engine system of the present invention.
Fig. 7 is a schematic structural diagram of an electronic terminal according to an embodiment of the present application.
Description of the element reference numerals
100 reasoning engine system based on elastic batch processing
110 video memory buffer pool module
120 reasoning engine module
130 flexible batch scheduling module
131 idle engine management unit
132 batch processing data processing unit
133 engine scheduling unit
1101 processor
1102 memory
S110 to S130
S131 to S134 steps
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
The present embodiment aims to provide an inference engine system, method and electronic device based on flexible batch processing, which are used for improving the performance of a GPU processor engine.
The embodiment aims to design and realize a deep neural network inference engine system with low delay, high throughput and based on elastic batch processing, and provides three modules, including: the video memory buffer pool, the multi-granularity deep neural network inference engine and the elastic batch processing scheduler at the graphic processor end respectively complete decoupling of calculation and data transmission, parallel execution of batch processing tasks of different sizes and batch processing of different sizes of elastic organizations, so that application developers can conveniently land on the ground to obtain better performance with less parameter configuration, including lower response delay and higher system throughput.
The principle and implementation of the flexible batch process-based reasoning engine system, method and electronic device of the present invention will be described in detail below, so that those skilled in the art can understand the flexible batch process-based reasoning engine system, method and electronic device without creative efforts.
As shown in fig. 1, the present embodiment provides an inference engine method based on elastic batch processing, which includes the following steps:
step S110, acquiring data of a request to be reasoned, which is input by a user;
step S120, acquiring the maximum parallel batch processing quantity and the quantity of the requests to be reasoned;
step S130, organizing the data of the inference request to be processed into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the number of the inference request to be processed, awakening a sub-engine corresponding to the batch processing data in the deep neural network inference engine module, and processing the inference request to be processed by the sub-engine.
The following describes steps S110 to S130 of the inference engine method based on flexible batch processing according to the present embodiment in detail.
Step S110, acquiring data of the request to be reasoned, which is input by the user.
In this embodiment, the inference engine method based on elastic batch processing further includes: storing the data of the request to be inferred input by the user into a video memory buffer pool consisting of continuous video memory address spaces; the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing one piece of data to be reasoned and requested.
Specifically, in this embodiment, the video memory buffer pool mainly includes a segment of continuous video memory address space allocated at the graphics processing end, and the segment of continuous address space is divided into video memory blocks with the same size for storing input data of each inference request to be executed (i.e., data of the inference request to be executed). The inference request to be processed is cached in a video memory of the graphic processor, and in addition, a plurality of different input data of the inference request to be processed can be organized into batch input data.
And step S120, acquiring the maximum parallel batch processing quantity and the quantity of the requests to be inferred.
In this embodiment, the maximum number of parallel batches is obtained based on the longest response time to process the inference request.
In order to meet the service quality requirement of the whole flexible batch processing, an application developer needs to set the longest response time of the acceptable inference request service, so as to obtain the acceptable maximum parallel batch processing size, namely, a user customizes the acceptable processing delay of the longest deep neural network inference request according to the requirement of the user as the service quality requirement. In the subsequent scheduling process, the parameter of the longest response time for processing the inference request is used as the upper limit of the sum of the batch processing sizes of all the sub-engines in work, and the scheduling of the whole system is carried out according to the upper limit.
Step S130, organizing the data of the inference request to be processed into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the number of the inference request to be processed, awakening a sub-engine corresponding to the batch processing data in the deep neural network inference engine module, and processing the inference request to be processed by the sub-engine.
Specifically, as shown in fig. 2, in this embodiment, the organizing the to-be-processed inference request data into batch data of an appropriate batch size as needed, and waking up a sub-engine in the deep neural network inference engine module corresponding to the batch data size includes:
step S131, whether idle sub-engines exist in an engine queue of the deep neural network inference engine module is detected, if yes, step S132 is executed, the number of schedulable requests to be inferred is selected from the video memory buffer pool, and if not, whether idle sub-engines exist in the engine queue of the deep neural network inference engine module is continuously detected.
Checking idle sub-engine queue status: checking the state of an idle sub-engine queue in the multi-granularity reasoning engine, and if no idle sub-engine exists, circularly inquiring the idle sub-engine queue until the idle sub-engine queue exists; if the sub-engine is idle, the next step is carried out, and scheduling is tried.
Step S133, organizing the inference request data to be processed into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing number and the number of the inference requests to be processed;
checking the buffer pool state: checking the state of the video memory buffer pool, and returning to continuously detect the idle sub-engine queue state if no request exists; and if the requests are to be scheduled, selecting the schedulable requests from the video memory buffer pool.
Selecting a schedulable request from a video memory buffer pool: in order to guarantee the continuity of different request data storage positions, the requests which can be dispatched at most from the existing storage positions are dispatched from the video memory buffer pool.
Step S134, scheduling a sub-engine corresponding to the size of the batch data from the idle sub-engines.
Scheduling the appropriate sub-engine to process the request: and organizing the inference request data into batch processing data with proper size according to idle sub-engines in the multi-granularity inference engine, and calling the sub-engines to process the data.
In this embodiment, if a single sub-engine does not process the request to be inferred once, the sub-engines corresponding to the remaining requests to be inferred are continuously obtained from the idle sub-engines, and the remaining requests to be inferred are continuously processed.
There are schedulable request and idle sub-engines: the selected schedulable request may not be able to be processed by a single sub-engine that is scheduled at a time. If there are still schedulable requests and there are still idle sub-engines, go to step S134 to continue scheduling; otherwise, jumping to step S131 to repeat the scheduling process.
In this embodiment, the inference engine method based on elastic batch processing further includes:
constructing a deep neural network reasoning model; and configuring a sub-engine in the deep neural network inference engine module according to the deep neural network inference model.
The deep neural network inference engine module comprises a plurality of sub-engines for processing different batch processing data sizes, and the sub-engines share the parameters of the deep neural network inference model.
In this embodiment, the deep neural network inference engine module is a multi-granularity inference engine composed of a plurality of deep neural network inference sub-engines capable of processing different batch processing sizes. Each sub-engine in the multi-granularity reasoning engine shares parameters such as weight of the same deep neural network model, the batch processing size and the number of the sub-engines can be configured and optimized according to the will of an application developer, and when the parameters are not specified by the application developer, the deep neural network reasoning engine module is preferentially configured according to the elastic batch processing module. The multi-granularity reasoning engine can simultaneously execute a plurality of sub-engines on the graphics processor in parallel, and the sub-engines have mutual influence and jointly determine the processing time of the reasoning request.
As shown in fig. 3, the present embodiment further provides an inference engine system 100 based on elastic batch processing, where the inference engine system 100 based on elastic batch processing includes: a video memory buffer pool module 110, an inference engine module 120 and an elastic batch scheduling module 130.
That is, the inference engine system 100 based on elastic batch processing in this embodiment includes a video memory buffer pool at the graphics processor end, a multi-granularity deep neural network inference engine, and an elastic batch processing scheduler. The core of the embodiment is a software system for designing an elastic batch processing deep neural network inference engine.
In this embodiment, the video memory buffer pool module 110 is configured to obtain and store data of a request to be inferred input by a user.
Specifically, the video memory buffer pool module 110 is mainly composed of a segment of continuous video memory address space allocated at the graphics processing end, and the segment of continuous address space is divided into video memory blocks with the same size for storing the input data of each inference request to be executed. The module caches the inference request to be processed in a video memory of the graphics processor, and can also organize input data of a plurality of different inference requests into batch input data.
The inference engine module 120 is used to configure a plurality of sub-engines in the deep neural network inference engine module 120.
The inference engine module 120 is a multi-granularity inference engine composed of a plurality of deep neural network inference sub-engines capable of handling different batch processing sizes. Each sub-engine in the multi-granularity reasoning engine shares parameters such as weight of the same deep neural network model, the batch processing size and the number of the sub-engines can be configured and optimized according to the will of an application developer, and when the parameters are not specified by the application developer, the module is preferentially configured according to the elastic batch processing module. The multi-granularity reasoning engine can simultaneously execute a plurality of sub-engines on the graphics processor in parallel, and the sub-engines have mutual influence and jointly determine the processing time of the reasoning request.
In this embodiment, the flexible batch scheduling module 130 is configured to obtain the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, organize the to-be-processed reasoned request data into batch processing data with an appropriate batch processing size as needed according to the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, and wake up a sub-engine corresponding to the batch processing data in the deep neural network inference engine module 120, so that the sub-engine processes the to-be-processed reasoned requests.
In this embodiment, the maximum number of parallel batches is obtained based on the longest response time to process the inference request.
Specifically, in this embodiment, as shown in fig. 6, the flexible batch scheduling module 130 includes: an idle engine management unit 131, a batch data processing unit 132, and an engine scheduling unit 133.
The idle engine management unit 131 is configured to detect whether an idle sub-engine exists in an engine queue of the deep neural network inference engine module 120, and if so, select the number of schedulable requests to be inferred from the video memory buffer pool.
The batch processing data processing unit 132 is configured to organize the inference request data to be processed into batch processing data of an appropriate batch processing size as needed according to the idle sub-engines, the maximum parallel batch processing number, and the number of the inference requests to be processed.
The engine scheduling unit 133 is configured to schedule a sub-engine corresponding to the size of the batch data from the idle sub-engines.
The flexible batch scheduling module 130 load coordinates the video memory buffer pool module 110 and the inference engine module 120. To meet the quality of service requirements of the entire flexible batch system, the application developer needs to set the maximum response time for the acceptable inference request service, from which the flexible batch system gets the maximum parallel batch size that is acceptable. The flexible batch scheduling module 130 organizes the data of the inference requests to be processed cached in the video memory buffer pool into batch data of a proper batch size as required according to the obtained maximum parallel batch size and the number of the inference requests to be processed in the current video memory buffer pool. After the data preparation is completed, the flexible batch scheduling module 130 wakes up the sub-engine of the deep neural network inference engine module 120 corresponding to the batch size, and the sub-engine is responsible for processing the inference request of the batch. The scheduling algorithm involved in the flexible batch scheduling module 130 is specifically shown in the following table, algorithm 1.
TABLE 1 Flexible batch scheduling Algorithm
Figure BDA0002266228460000071
As shown in fig. 5 and fig. 6, in order to further illustrate the implementation principle, the overall execution flow of the present example will be described in detail below.
The process comprises the following steps:
1) the user constructs an inference model: belonging to the multi-granularity inference engine module 120 function. And the user constructs an inference model according to the inference request of the user.
2) User specified quality of service requirements: belonging to the flexible batch scheduling module 130 function. And the user self-defines the acceptable processing delay of the longest deep neural network inference request according to the requirement of the user as the service quality requirement.
3) Setting parameters required by an inference system: belonging to the flexible batch scheduling module 130 function. The flexible batch scheduling module 130 calculates the maximum batch size that can meet the user-defined service quality requirement, and in the subsequent scheduling process, the flexible batch scheduler uses the parameter as the upper limit of the sum of the batch sizes of all the sub-engines in work, and performs scheduling of the whole inference system according to the upper limit. Meanwhile, the flexible batch scheduling module 130 sets the number of sub-engines in the multi-granularity inference engine, the batch processing size of each sub-engine, and the size of the buffer pool in the video memory buffer pool module 110.
4) Receiving an inference request: belonging to the function of the video memory buffer pool module 110. Receiving inference request, and caching to buffer pool of graphics processor end
5) Checking idle sub-engine queue status: belonging to the flexible batch scheduling module 130 function. Checking the state of an idle sub-engine queue in the multi-granularity reasoning engine, and if no idle sub-engine exists, circularly inquiring the idle sub-engine queue until the idle sub-engine queue exists; if the sub-engine is idle, the next step is carried out, and scheduling is tried.
6) Checking the buffer pool state: belonging to the flexible batch scheduling module 130 function. Checking the state of a video memory buffer pool, and jumping to 5) checking the idle sub-engine queue state if no request exists; and if the request is to be scheduled, the next step is carried out.
7) Selecting a schedulable request from a video memory buffer pool: belonging to the flexible batch scheduling module 130 function. In order to ensure the continuity of the data storage locations of different requests, the flexible batch scheduling module 130 schedules the request selected from the existing storage locations from the video memory buffer pool to be schedulable at most.
8) Scheduling the appropriate sub-engine to process the request: belonging to the functions of the flexible batch scheduling module 130 and the multi-granularity inference engine. The flexible batch scheduling module 130 organizes the inference request data into batch data of a suitable size according to idle sub-engines in the multi-granularity inference engine and calls the sub-engines therein to process the data
9) There are schedulable request and idle sub-engines: belonging to the flexible batch scheduling module 130 function. 7) The schedulable request selected in (8) may not be able to be processed by a single sub-engine scheduled at a time. If the schedulable request still exists and the idle sub-engine still exists, skipping to 8) to continue scheduling; otherwise, jumping to 5) and repeating the scheduling process.
Embodiments of the present invention also provide an electronic device, such as but not limited to a medical examination device, an image processing device, etc., as shown in fig. 7, the electronic device processor 1101 and the memory 1102; the memory 1102 is connected to the processor 1101 through a system bus and is used for storing a computer program, and the processor 1101 is used for running the computer program, so that the electronic device executes the inference engine method based on elastic batch processing. The inference engine method based on flexible batch processing has been described in detail above, and is not described herein again.
The flexible batch process based inference engine approach can be applied to many types of electronic devices. The electronic device is, for example, a controller, such as an arm (advanced RISC machines) controller, an fpga (field programmable Gate array) controller, a soc (system on chip) controller, a dsp (digital signal processing) controller, or an mcu (micro controller unit) controller. The electronic device may also be, for example, a computer that includes components such as memory, a memory controller, one or more processing units (CPUs), a peripheral interface, RF circuitry, audio circuitry, speakers, a microphone, an input/output (I/O) subsystem, a display screen, other output or control devices, and external ports; the computer includes, but is not limited to, Personal computers such as desktop computers, notebook computers, tablet computers, smart phones, smart televisions, Personal Digital Assistants (PDAs), and the like. In other embodiments, the electronic device may also be a server, and the server may be arranged on one or more entity servers according to various factors such as functions, loads, and the like, or may be formed by a distributed or centralized server cluster, which is not limited in this embodiment.
In an actual implementation manner, the electronic device is, for example, an electronic device installed with an Android operating system or an iOS operating system, or an operating system such as Palm OS, Symbian, Black Berry OS, or Windows Phone.
In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, cameras, or other electronic components for performing the above-described flexible batch process based inference engine method.
It should be noted that the above-mentioned system bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus. The communication interface is used for realizing communication between the database access system and other devices (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
The Processor 1101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
Furthermore, the present embodiment also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the elastic batch processing based inference engine method. The inference engine method based on flexible batch processing has been described in detail above, and is not described herein again.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
In summary, the present invention establishes a system including three aspects of the video memory buffer pool module 110, the multi-granularity inference engine module 120, and the flexible batch scheduling module 130, so as to implement a deep neural network inference engine system with low latency and high throughput based on flexible batch processing, and maximize the response latency speed and throughput of the whole engine system without adding hardware devices including a graphics processor; the achievement of the invention can provide support for the emerging deep neural network technology landing, and the achievement of the invention can construct a neural network inference task scheduling system which has commercial significance and is based on the elastic batch processing technology, and the optimization of the neural network inference scheduling service is simplified facing users. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (10)

1. An elastic batch processing based reasoning engine method, characterized in that the elastic batch processing based reasoning engine method comprises:
acquiring data of a request to be reasoned, which is input by a user;
acquiring the maximum parallel batch processing quantity and the quantity of the requests to be inferred;
organizing the data of the inference requests to be processed into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the number of the inference requests to be processed, awakening a sub-engine corresponding to the batch processing data in a deep neural network inference engine module, and processing the inference requests to be processed by the sub-engine.
2. The elastic batch based reasoning engine method according to claim 1, further comprising:
constructing a deep neural network reasoning model;
configuring a sub-engine in the deep neural network inference engine module according to a deep neural network inference model;
the deep neural network inference engine module comprises a plurality of sub-engines for processing different batch processing data sizes, and the sub-engines share the parameters of the deep neural network inference model.
3. The elastic batch based reasoning engine method according to claim 1, further comprising:
storing the data of the request to be inferred input by the user into a video memory buffer pool consisting of continuous video memory address spaces;
the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing one piece of data to be reasoned and requested.
4. The flexible batch based reasoning engine method of claim 1, wherein the maximum number of parallel batches is obtained based on a longest response time for processing the reasoning request.
5. The flexible batch based reasoning engine method of claim 3, wherein the organizing the inference request data to be processed into batch data of a suitable batch size on demand and waking up a sub-engine corresponding to the batch data size in the deep neural network reasoning engine module comprises:
detecting whether idle sub-engines exist in an engine queue of a deep neural network inference engine module, and if so, selecting the number of schedulable requests to be inferred from the video memory buffer pool;
organizing the inference request data to be processed into batch processing data with proper batch processing size according to the idle sub-engines, the maximum parallel batch processing number and the number of the inference requests to be processed;
and scheduling the sub-engine corresponding to the size of the batch processing data from the idle sub-engines.
6. The flexible batch processing-based inference engine method of claim 5, wherein if a single sub-engine has not processed the request to be inferred at one time, the sub-engine corresponding to the remaining request to be inferred is continuously obtained from the idle sub-engine, and the remaining request to be inferred is continuously processed.
7. An elastic batch based reasoning engine system, characterized in that the elastic batch based reasoning engine system comprises:
the video memory buffer pool module is used for acquiring and storing the data of the request to be reasoned, which is input by a user;
the inference engine module is used for configuring a plurality of sub-engines in the deep neural network inference engine module;
the flexible batch processing scheduling module is used for acquiring the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, organizing the to-be-processed reasoned request data into batch processing data with a proper batch processing size according to the maximum parallel batch processing quantity and the quantity of the to-be-reasoned requests, awakening a sub-engine corresponding to the batch processing data in the deep neural network reasoned engine module, and processing the to-be-processed reasoned requests by the sub-engine.
8. The flexible batch based reasoning engine system of claim 7, wherein the maximum number of parallel batches is derived from a longest response time for processing the reasoning request.
9. The flexible batch based reasoning engine system of claim 7, wherein the flexible batch scheduling module comprises:
the idle engine management unit is used for detecting whether idle sub-engines exist in an engine queue of the deep neural network inference engine module, and if the idle sub-engines exist in the engine queue, selecting the number of schedulable requests to be inferred from the video memory buffer pool;
the batch processing data processing unit is used for organizing the inference request data to be processed into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the quantity of the inference requests to be processed;
and the engine scheduling unit is used for scheduling the sub-engines corresponding to the batch processing data from the idle sub-engines.
10. An electronic device comprising a processor and a memory, the memory storing program instructions, the processor executing the program instructions to implement the elastic batch process based inference engine method of any of claims 1 to 6.
CN201911088741.9A 2019-11-08 2019-11-08 Reasoning engine system and method based on elastic batch processing and electronic equipment Active CN110837419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911088741.9A CN110837419B (en) 2019-11-08 2019-11-08 Reasoning engine system and method based on elastic batch processing and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911088741.9A CN110837419B (en) 2019-11-08 2019-11-08 Reasoning engine system and method based on elastic batch processing and electronic equipment

Publications (2)

Publication Number Publication Date
CN110837419A true CN110837419A (en) 2020-02-25
CN110837419B CN110837419B (en) 2023-05-19

Family

ID=69574710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911088741.9A Active CN110837419B (en) 2019-11-08 2019-11-08 Reasoning engine system and method based on elastic batch processing and electronic equipment

Country Status (1)

Country Link
CN (1) CN110837419B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342808A (en) * 2021-05-26 2021-09-03 电子科技大学 Knowledge graph inference engine architecture system based on electromechanical equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102104626A (en) * 2009-12-22 2011-06-22 英特尔公司 Systems and methods for energy efficient load balancing at server clusters
CN107273195A (en) * 2017-05-24 2017-10-20 上海艾融软件股份有限公司 A kind of batch processing method of big data, device and computer system
CN109919315A (en) * 2019-03-13 2019-06-21 科大讯飞股份有限公司 A kind of forward inference method, apparatus, equipment and the storage medium of neural network
CN110019071A (en) * 2017-11-15 2019-07-16 北大方正集团有限公司 Data processing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102104626A (en) * 2009-12-22 2011-06-22 英特尔公司 Systems and methods for energy efficient load balancing at server clusters
CN107273195A (en) * 2017-05-24 2017-10-20 上海艾融软件股份有限公司 A kind of batch processing method of big data, device and computer system
CN110019071A (en) * 2017-11-15 2019-07-16 北大方正集团有限公司 Data processing method and device
CN109919315A (en) * 2019-03-13 2019-06-21 科大讯飞股份有限公司 A kind of forward inference method, apparatus, equipment and the storage medium of neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342808A (en) * 2021-05-26 2021-09-03 电子科技大学 Knowledge graph inference engine architecture system based on electromechanical equipment
CN113342808B (en) * 2021-05-26 2022-11-08 电子科技大学 Knowledge graph inference engine architecture system based on electromechanical equipment

Also Published As

Publication number Publication date
CN110837419B (en) 2023-05-19

Similar Documents

Publication Publication Date Title
KR102197874B1 (en) System on chip including multi-core processor and thread scheduling method thereof
CN104794194B (en) A kind of distributed heterogeneous concurrent computational system towards large scale multimedia retrieval
US20140040532A1 (en) Stacked memory device with helper processor
CN109165728B (en) Basic computing unit and computing method of convolutional neural network
CN113157410A (en) Thread pool adjusting method and device, storage medium and electronic equipment
CN113590508B (en) Dynamic reconfigurable memory address mapping method and device
US10037225B2 (en) Method and system for scheduling computing
CN115686805A (en) GPU resource sharing method and device, and GPU resource sharing scheduling method and device
US11635904B2 (en) Matrix storage method, matrix access method, apparatus and electronic device
CN110908797B (en) Call request data processing method, device, equipment, storage medium and system
CN110837419B (en) Reasoning engine system and method based on elastic batch processing and electronic equipment
CN110738317A (en) FPGA-based deformable convolution network operation method, device and system
US11954518B2 (en) User-defined metered priority queues
US20120151145A1 (en) Data Driven Micro-Scheduling of the Individual Processing Elements of a Wide Vector SIMD Processing Unit
CN112084023A (en) Data parallel processing method, electronic equipment and computer readable storage medium
EP4035016A1 (en) Processor and interrupt controller therein
US20160147532A1 (en) Method for handling interrupts
CN112130977B (en) Task scheduling method, device, equipment and medium
US20210200584A1 (en) Multi-processor system, multi-core processing device, and method of operating the same
CN112766475A (en) Processing unit and artificial intelligence processor
US10073723B2 (en) Dynamic range-based messaging
CN117057411B (en) Large language model training method, device, equipment and storage medium
US11899551B1 (en) On-chip software-based activity monitor to configure throttling at a hardware-based activity monitor
CN113032154B (en) Scheduling method and device for virtual CPU, electronic equipment and storage medium
CN116360941A (en) Multi-core DSP-oriented parallel computing resource organization scheduling method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant