CN110837419B - Reasoning engine system and method based on elastic batch processing and electronic equipment - Google Patents

Reasoning engine system and method based on elastic batch processing and electronic equipment Download PDF

Info

Publication number
CN110837419B
CN110837419B CN201911088741.9A CN201911088741A CN110837419B CN 110837419 B CN110837419 B CN 110837419B CN 201911088741 A CN201911088741 A CN 201911088741A CN 110837419 B CN110837419 B CN 110837419B
Authority
CN
China
Prior art keywords
engine
batch
sub
reasoning
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911088741.9A
Other languages
Chinese (zh)
Other versions
CN110837419A (en
Inventor
陈�全
过敏意
崔炜皞
沈耀
姚斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201911088741.9A priority Critical patent/CN110837419B/en
Publication of CN110837419A publication Critical patent/CN110837419A/en
Application granted granted Critical
Publication of CN110837419B publication Critical patent/CN110837419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides an elastic batch processing-based reasoning engine system, a method and electronic equipment, wherein the elastic batch processing-based reasoning engine method comprises the following steps: acquiring to-be-inferred request data input by a user; obtaining the maximum parallel batch processing quantity and the quantity of requests to be inferred; and organizing the to-be-processed reasoning request data into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the to-be-reasoning request number, waking up a sub-engine corresponding to the batch processing data in a deep neural network reasoning engine module, and processing the to-be-processed reasoning request by the sub-engine. The invention maximizes the response delay speed and throughput of the engine system without increasing hardware devices including graphics processors and the like.

Description

Reasoning engine system and method based on elastic batch processing and electronic equipment
Technical Field
The invention relates to the technical field of processors, in particular to the technical field of GPU, and specifically relates to an inference engine system and method based on elastic batch processing and electronic equipment.
Background
With the massive deployment of computationally intensive applications such as speech recognition, machine translation, personal assistants, etc., mainstream private data centers or public cloud platforms have begun to use coprocessors like GPUs in large quantities to cope with the problem of insufficient computational power of traditional CPUs. GPUs were initially dedicated processors designed for graphics image computation, and more non-graphics image applications were migrated to GPU platforms to meet their rapidly growing computing demands due to their unmatched parallelism with conventional CPUs. Research has shown that non-graphics image applications often do not have sufficient parallelism to fully utilize the hardware resources of the GPU, resulting in wasted hardware resources. On the other hand, due to the development of GPU architecture and process, more and more stream multiprocessors (Streaming Multiprocessor, SM) are integrated into one GPU, so that the problem of resource waste is more prominent.
GPUs have been widely used to provide deep learning-based online services, and how to improve the performance of GPUs, including lower response delay, higher system throughput, etc., are technical problems that those skilled in the art need to solve.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide an inference engine system, a method and an electronic device based on elastic batch processing, for improving the performance of a GPU processor engine.
To achieve the above and other related objects, an embodiment of the present invention provides an elastic batch-based inference engine method, including: acquiring to-be-inferred request data input by a user; obtaining the maximum parallel batch processing quantity and the quantity of requests to be inferred; and organizing the to-be-processed reasoning request data into batch processing data with proper batch processing size according to the maximum parallel batch processing number and the to-be-reasoning request number, waking up a sub-engine corresponding to the batch processing data in a deep neural network reasoning engine module, and processing the to-be-processed reasoning request by the sub-engine.
In an embodiment of the present application, the reasoning engine method based on elastic batch processing further includes: constructing a deep neural network reasoning model; configuring sub-engines in the deep neural network reasoning engine module according to the deep neural network reasoning model; the deep neural network reasoning engine module comprises a plurality of sub-engines for processing different batch data sizes, and each sub-engine shares parameters of the deep neural network reasoning model.
In an embodiment of the present application, the reasoning engine method based on elastic batch processing further includes: storing the data to be inferred request input by the user into a video memory buffer pool formed by continuous video memory address space; the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing one to-be-inferred request data.
In one embodiment of the present application, the maximum number of parallel batches is obtained based on the longest response time for processing the inference request.
In an embodiment of the present application, the organizing the to-be-processed reasoning request data into batch data with a suitable batch size as required, and waking up a sub-engine corresponding to the batch data in the deep neural network reasoning engine module includes: detecting whether an idle sub-engine exists in an engine queue of a deep neural network reasoning engine module, and if so, selecting the number of schedulable requests to be reasoning from the video memory buffer pool; organizing the to-be-processed reasoning request data into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the to-be-reasoning request quantity; scheduling a sub-engine corresponding to the size of the batch data from the idle sub-engines.
In an embodiment of the present application, if a single sub-engine does not process the request to be inferred at a time, the sub-engine corresponding to the remaining request to be inferred is continuously obtained from the idle sub-engines, and the remaining request to be inferred is continuously processed.
The embodiment of the invention also provides an reasoning engine system based on elastic batch processing, which comprises: the video memory buffer pool module is used for acquiring and storing to-be-inferred request data input by a user; the reasoning engine module is used for configuring a plurality of sub-engines in the deep neural network reasoning engine module; the elastic batch processing scheduling module is used for acquiring the maximum parallel batch processing quantity and the quantity of the to-be-inferred requests, organizing the to-be-inferred request data into batch processing data with proper batch processing size according to the maximum parallel batch processing quantity and the quantity of the to-be-inferred requests, and waking up sub-engines corresponding to the batch processing data in the deep neural network inference engine module, wherein the sub-engines process the to-be-inferred requests.
In one embodiment of the present application, the maximum number of parallel batches is obtained based on the longest response time for processing the inference request.
In an embodiment of the present application, the flexible batch scheduling module includes: the idle engine management unit is used for detecting whether idle sub-engines exist in an engine queue of the deep neural network reasoning engine module, and if so, the number of schedulable to-be-reasoning requests is selected from the video memory buffer pool; the batch processing data processing unit is used for organizing the to-be-processed reasoning request data into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the to-be-reasoning request quantity; and the engine scheduling unit is used for scheduling the sub-engine corresponding to the size of the batch processing data from the idle sub-engines.
Embodiments of the present invention also provide an electronic device comprising a processor and a memory, the memory storing program instructions that are executed by the processor to implement the flexible batch-based inference engine method as described above.
As described above, the reasoning engine system, the method and the electronic device based on the elastic batch processing have the following beneficial effects:
1. the invention establishes a system comprising a video memory buffer pool module, a multi-granularity reasoning engine module and an elastic batch processing scheduling module, realizes a deep neural network reasoning engine system with low delay, high throughput and based on elastic batch processing, and maximizes the response delay speed and throughput of the whole engine system on the premise of not increasing hardware equipment comprising a graphic processor and the like.
2. The achievement of the invention can provide support for the emerging deep neural network technology landing, and can lead the construction of a neural network reasoning task scheduling system with commercial significance and based on the elastic batch processing technology, and the optimization of the neural network reasoning scheduling service is simplified for users.
Drawings
FIG. 1 is a schematic diagram of the overall flow of the inference engine method based on elastic batch processing of the present invention.
FIG. 2 is a flow chart of a scheduling sub-engine in the flexible batch-based inference engine method of the present invention.
Fig. 3 and 4 are schematic diagrams illustrating embodiments of the flexible batch-based reasoning engine method of the present invention.
Fig. 5 is a block diagram showing the overall schematic of the reasoning engine system based on elastic batch processing of the present invention.
FIG. 6 is a schematic block diagram of an elastic batch scheduling module in the elastic batch based reasoning engine system of the present invention.
Fig. 7 is a schematic structural diagram of an electronic terminal according to an embodiment of the present application.
Description of element reference numerals
100. Reasoning engine system based on elastic batch processing
110. Buffer tank module for video memory
120. Reasoning engine module
130. Elastic batch processing scheduling module
131. Idle engine management unit
132. Batch processing data processing unit
133. Engine scheduling unit
1101. Processor and method for controlling the same
1102. Memory device
S110 to S130 steps
S131 to S134 steps
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
The embodiment aims to provide an reasoning engine system, a reasoning engine method and electronic equipment based on elastic batch processing, which are used for improving the performance of a GPU processor engine.
The embodiment aims to design and realize a low-delay high-throughput deep neural network reasoning engine system based on elastic batch processing, and provides three modules, which comprises: the system comprises a video memory buffer pool at a graphic processor end, a multi-granularity deep neural network reasoning engine and an elastic batch processing scheduler, wherein decoupling of calculation and data transmission is completed respectively, parallel execution of batch processing tasks with different sizes and batch processing with different sizes are organized elastically, so that application developers can conveniently land on the deep neural network, better performance is obtained through fewer parameter configurations, and the system throughput is higher.
The principle and implementation of the elastic batch-based inference engine system, method and electronic device of the present embodiment will be described in detail below, so that those skilled in the art may understand the elastic batch-based inference engine system, method and electronic device of the present invention without creative effort.
As shown in fig. 1, the embodiment provides an inference engine method based on elastic batch processing, which includes the following steps:
step S110, obtaining to-be-inferred request data input by a user;
step S120, obtaining the maximum parallel batch processing quantity and the quantity of the to-be-inferred requests;
and step S130, organizing the data of the to-be-processed reasoning request into batch processing data with proper batch processing size according to the maximum parallel batch processing quantity and the quantity of the to-be-reasoning request, waking up a sub-engine corresponding to the batch processing data in a deep neural network reasoning engine module, and processing the to-be-processed reasoning request by the sub-engine.
The following describes the steps S110 to S130 of the inference engine method based on elastic batch processing in this embodiment in detail.
Step S110, obtaining the to-be-inferred request data input by the user.
In this embodiment, the reasoning engine method based on elastic batch processing further includes: storing the data to be inferred request input by the user into a video memory buffer pool formed by continuous video memory address space; the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing one to-be-inferred request data.
Specifically, in this embodiment, the video memory buffer pool mainly comprises a segment of continuous video memory address space allocated at the graphics processing end, where the segment of continuous video memory address space is divided into video memory blocks with the same size, so as to store the input data (i.e. the data of the to-be-inferred request) of each to-be-executed inference request. And caching the reasoning request to be processed into a video memory of the graphics processor, and organizing a plurality of different input data of the reasoning request into batch processing input data.
Step S120, the maximum parallel batch processing number and the number of requests to be inferred are obtained.
In this embodiment, the maximum number of parallel batches is obtained according to the longest response time for processing the reasoning request.
In order to meet the service quality requirement of the whole elastic batch, an application developer needs to set the longest response time of acceptable reasoning request service, and accordingly the acceptable maximum parallel batch size is obtained, namely, the user self-defines the acceptable processing delay of the longest depth neural network reasoning request as the service quality requirement according to the requirement of the user. In the following scheduling process, the parameter of the longest response time of the processing reasoning request is taken as the upper limit of the sum of the batch processing sizes of all the working sub-engines, and the scheduling of the whole system is carried out according to the upper limit.
And step S130, organizing the data of the to-be-processed reasoning request into batch processing data with proper batch processing size according to the maximum parallel batch processing quantity and the quantity of the to-be-reasoning request, waking up a sub-engine corresponding to the batch processing data in a deep neural network reasoning engine module, and processing the to-be-processed reasoning request by the sub-engine.
Specifically, as shown in fig. 2, in this embodiment, the organizing the to-be-processed reasoning request data into batch data with a suitable batch size as required, and waking up a sub-engine corresponding to the batch data in the deep neural network reasoning engine module includes:
step S131, detecting whether an idle sub-engine exists in an engine queue of the deep neural network reasoning engine module, if yes, executing step S132, selecting the number of schedulable to-be-inferred requests from the video memory buffer pool, and if not, continuously detecting whether the idle sub-engine exists in the engine queue of the deep neural network reasoning engine module.
Checking the idle sub-engine queue status: checking the state of an idle sub-engine queue in the multi-granularity reasoning engine, and if no idle sub-engine exists, circularly inquiring the idle sub-engine queue until the idle sub-engine queue exists; if there is an idle sub-engine to proceed to the next step, the scheduling is attempted.
Step S133, organizing the to-be-processed reasoning request data into batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the to-be-reasoning request quantity;
checking the buffer pool state: checking the state of the video memory buffer pool, and returning to continuously detect the state of the idle sub-engine queue if no request exists; if the request to be scheduled is available, selecting a schedulable request from the video memory buffer pool.
Selecting a schedulable request from a video memory buffer pool: in order to ensure the continuity of different request data storage positions, the most schedulable requests selected from the existing storage positions are scheduled from a video memory buffer pool.
Step S134, scheduling a sub-engine corresponding to the size of the batch data from the idle sub-engines.
Scheduling the appropriate sub-engine to process the request: the data is organized into batch processing data of proper size according to idle sub-engines in the multi-granularity reasoning engine, and the sub-engines are called to process the data.
In this embodiment, if a single sub-engine does not process the request to be inferred at a time, the sub-engine corresponding to the remaining request to be inferred is continuously acquired from the idle sub-engine, and the remaining request to be inferred is continuously processed.
Namely, there are schedulable request and idle sub-engines: the selected schedulable request may not be processed by a single sub-engine that is scheduled at a time. If there are schedulable requests and there are idle sub-engines, jump to step S134 to continue scheduling; otherwise, the process jumps to step S131 to repeat the scheduling process.
In this embodiment, the reasoning engine method based on elastic batch processing further includes:
constructing a deep neural network reasoning model; and configuring sub-engines in the deep neural network reasoning engine module according to the deep neural network reasoning model.
The deep neural network reasoning engine module comprises a plurality of sub-engines for processing different batch data sizes, and each sub-engine shares parameters of the deep neural network reasoning model.
In this embodiment, the deep neural network inference engine module is a multi-granularity inference engine composed of a plurality of deep neural network inference sub-engines capable of processing different batch sizes. Each sub-engine in the multi-granularity reasoning engine shares parameters such as weight of the same deep neural network model, the batch processing size of each sub-engine and the number of the sub-engines can be configured and optimized according to the intention of a developer of the application, and when the application developer does not specify the parameters, the deep neural network reasoning engine module is preferably configured according to the elastic batch processing module. The multi-granularity reasoning engine can simultaneously execute a plurality of sub-engines on the graphics processor in parallel, and the sub-engines have mutual influence to determine the processing time of the reasoning request together.
As shown in fig. 3, the present embodiment further provides an inference engine system 100 based on elastic batch processing, where the inference engine system 100 based on elastic batch processing includes: the video memory buffer pool module 110, the inference engine module 120, and the flexible batch scheduling module 130.
That is, the reasoning engine system 100 based on elastic batch processing in this embodiment includes a video memory buffer pool at the graphics processor end, a multi-granularity deep neural network reasoning engine, and an elastic batch processing scheduler. The core of the embodiment is a software system for designing an elastic batch processing deep neural network reasoning engine.
In this embodiment, the video memory buffer pool module 110 is configured to obtain and store the to-be-inferred request data input by the user.
Specifically, the memory buffer pool module 110 is mainly composed of a segment of continuous memory address space allocated at the graphics processing end, where the segment of continuous address space is divided into equally sized memory blocks for storing the input data of each reasoning request to be executed. The module caches the reasoning request to be processed into the video memory of the graphic processor end, and can organize a plurality of different input data of the reasoning request into batch processing input data.
The inference engine module 120 is configured to configure a plurality of sub-engines in the deep neural network inference engine module 120.
The inference engine module 120 is a multi-granularity inference engine that is comprised of a plurality of deep neural network inference sub-engines that can handle different batch sizes. Each sub-engine in the multi-granularity reasoning engine shares parameters such as weight of the same deep neural network model, the batch processing size of each sub-engine and the number of the sub-engines can be configured and optimized according to the intention of a developer of the application, and when the application developer does not specify the parameters, the module is preferably configured according to the elastic batch processing module. The multi-granularity reasoning engine can simultaneously execute a plurality of sub-engines on the graphics processor in parallel, and the sub-engines have mutual influence to determine the processing time of the reasoning request together.
In this embodiment, the elastic batch scheduling module 130 is configured to obtain a maximum number of parallel batches and a number of to-be-inferred requests, organize the to-be-inferred request data into batch data with a suitable batch size according to the maximum number of parallel batches and the number of to-be-inferred requests, and wake up sub-engines corresponding to the batch data in the deep neural network inference engine module 120, where the sub-engines process the to-be-inferred requests.
In this embodiment, the maximum number of parallel batches is obtained according to the longest response time for processing the reasoning request.
Specifically, in this embodiment, as shown in fig. 6, the flexible batch scheduling module 130 includes: idle engine management unit 131, batch data processing unit 132, and engine scheduling unit 133.
The idle engine management unit 131 is configured to detect whether an idle sub-engine exists in an engine queue of the deep neural network reasoning engine module 120, and if so, select the number of schedulable to-be-reasoning requests from the video memory buffer pool.
The batch data processing unit 132 is configured to organize the to-be-processed reasoning request data into batch data with a suitable batch size according to the idle sub-engine, the maximum parallel batch number and the to-be-reasoning request number.
The engine scheduling unit 133 is configured to schedule a sub-engine corresponding to the size of the batch data from the idle sub-engines.
The flexible batch scheduling module 130 load coordinates the video memory buffer pool module 110 with the inference engine module 120. In order to meet the quality of service requirements of the overall flexible batch system, the application developer needs to set the longest response time for acceptable inferential request services, from which the flexible batch system gets an acceptable maximum parallel batch size. The elastic batch scheduling module 130 organizes the pending inference request data cached in the video memory buffer pool into batch data of a proper batch size as required according to the obtained maximum parallel batch size and the number of pending inference requests in the current video memory buffer pool. After the data preparation is completed, the flexible batch dispatch module 130 wakes up the sub-engine of the deep neural network inference engine module 120 corresponding to the batch size, which is responsible for processing the inference request of the batch. The scheduling algorithm referred to by the flexible batch scheduling module 130 is specifically shown in table algorithm 1 below.
Table 1 elastic batch scheduling algorithm
Figure BDA0002266228460000071
/>
Figure BDA0002266228460000081
As shown in fig. 5 and 6, to further explain the present embodiment principle, the overall execution flow of the present example will be described in detail below.
The flow is as follows:
1) The user builds an inference model: belonging to the multi-granularity inference engine module 120 functionality. The user builds an inference model according to the own inference request.
2) The user specifies the quality of service requirements: belonging to the flexible batch dispatch module 130 function. The user self-defines the processing delay of the acceptable longest depth neural network reasoning request as the service quality requirement according to the own requirement.
3) Setting parameters required by an inference system: belonging to the flexible batch dispatch module 130 function. The elastic batch scheduling module 130 calculates the maximum batch size meeting the user-defined quality of service requirement, and in the following scheduling process, the elastic batch scheduler takes the parameter as the upper limit of the sum of the sub-engine batch sizes in all working, and performs scheduling of the whole inference system according to the parameter. The flexible batch scheduling module 130 also sets the number of sub-engines in the multi-granularity inference engine, the batch size of each sub-engine, and the size of the buffer pool in the video buffer pool module 110.
4) Receiving an inference request: belonging to the function of the video memory buffer pool module 110. Receiving the reasoning request and caching the reasoning request into a buffer pool of the graphics processor side
5) Checking the idle sub-engine queue status: belonging to the flexible batch dispatch module 130 function. Checking the state of an idle sub-engine queue in the multi-granularity reasoning engine, and if no idle sub-engine exists, circularly inquiring the idle sub-engine queue until the idle sub-engine queue exists; if there is an idle sub-engine to proceed to the next step, the scheduling is attempted.
6) Checking the buffer pool state: belonging to the flexible batch dispatch module 130 function. Checking the state of the video memory buffer pool, if no request is made, jumping to 5) checking the state of the idle sub-engine queue; if there is a request to be scheduled, then go to the next step.
7) Selecting a schedulable request from a video memory buffer pool: belonging to the flexible batch dispatch module 130 function. To ensure continuity of the different request data storage locations, the flexible batch scheduling module 130 schedules the most schedulable requests from the existing storage locations from the video memory buffer pool.
8) Scheduling the appropriate sub-engine to process the request: which is a function of the flexible batch dispatch module 130 and the multi-granularity inference engine. The flexible batch scheduling module 130 organizes the inference request data into batch data of suitable size according to idle sub-engines in the multi-granularity inference engine and invokes sub-engines therein to process the data
9) There are schedulable requests and idle sub-engines: belonging to the flexible batch dispatch module 130 function. 7) The selected schedulable request in 8) may not be completed at one time by the single sub-engine process scheduled in 8). If there are schedulable requests and there are idle sub-engines, jump to 8) continue scheduling; otherwise, jumping to 5) repeating the scheduling process.
Embodiments of the present invention also provide an electronic device, which is, but not limited to, a medical detection device, an image processing device, etc., as shown in fig. 7, the electronic device processor 1101 and memory 1102; the memory 1102 is connected to the processor 1101 through a system bus and performs communication with each other, the memory 1102 is used for storing a computer program, and the processor 1101 is used for running the computer program to cause the electronic device to execute the reasoning engine method based on elastic batch processing. The above description of the reasoning engine method based on elastic batch processing has been described in detail, and will not be repeated here.
The reasoning engine method based on elastic batch processing can be applied to various types of electronic equipment. The electronic device is, for example, a controller, such as, in particular, a ARM (Advanced RISC Machines) controller, a FPGA (Field Programmable Gate Array) controller, a SoC (System on Chip) controller, a DSP (Digital Signal Processing) controller, or a MCU (Micorcontroller Unit) controller, or the like. The electronic device may also be, for example, a computer including memory, a memory controller, one or more processing units (CPUs), peripheral interfaces, RF circuitry, audio circuitry, speakers, microphones, input/output (I/O) subsystems, display screens, other output or control devices, and external ports; the computer includes, but is not limited to, a personal computer such as a desktop computer, a notebook computer, a tablet computer, a smart phone, a smart television, a personal digital assistant (Personal Digital Assistant, PDA for short), and the like. In other embodiments, the electronic device may also be a server, where the server may be disposed on one or more physical servers according to a plurality of factors such as functions, loads, and the like, or may be formed by a distributed or centralized server cluster, which is not limited in this embodiment.
In an actual implementation manner, the electronic device is an electronic device that installs an Android operating system or an iOS operating system, or Palm OS, symbian (plug) or Black Berry OS, windows Phone, or other operating systems.
In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, cameras or other electronic elements for performing the above-described flexible batch-based inference engine method.
It should be noted that the system bus mentioned above may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus. The system bus may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus. The communication interface is used to enable communication between the database access system and other devices (e.g., clients, read-write libraries, and read-only libraries). The memory may comprise random access memory (Random Access Memory, RAM) and may also comprise non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor 1101 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In addition, the embodiment also provides a computer readable storage medium, on which a computer program is stored, the computer program implementing the reasoning engine method based on elastic batch processing when being executed by a processor. The above description of the reasoning engine method based on elastic batch processing has been described in detail, and will not be repeated here.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by computer program related hardware. The aforementioned computer program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.
In summary, the present invention establishes a system including three aspects of the video memory buffer pool module 110, the multi-granularity inference engine module 120 and the elastic batch processing scheduling module 130, and realizes a deep neural network inference engine system with low delay, high throughput and based on elastic batch processing, so that the response delay speed and throughput of the whole engine system are maximized without increasing hardware devices including graphics processors and the like; the achievement of the invention can provide support for the emerging deep neural network technology landing, and can lead the construction of a neural network reasoning task scheduling system with commercial significance and based on the elastic batch processing technology, and the optimization of the neural network reasoning scheduling service is simplified for users. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims (8)

1. An elastic batch-based reasoning engine method, characterized in that the elastic batch-based reasoning engine method comprises the following steps:
acquiring to-be-inferred request data input by a user;
obtaining the maximum parallel batch processing quantity and the quantity of requests to be inferred;
organizing the data of the request to be inferred into batch data with proper batch size according to the maximum parallel batch number and the number of the request to be inferred, waking up a sub-engine corresponding to the batch data in a deep neural network inference engine module, and processing the request to be inferred by the sub-engine;
further comprises:
storing the data to be inferred request input by the user into a video memory buffer pool formed by continuous video memory address space;
the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing data to be inferred and requested;
the reasoning engine method based on elastic batch processing further comprises the following steps:
constructing a deep neural network reasoning model;
configuring sub-engines in the deep neural network reasoning engine module according to the deep neural network reasoning model;
the deep neural network reasoning engine module comprises a plurality of sub-engines for processing different batch data sizes, and each sub-engine shares parameters of the deep neural network reasoning model.
2. The flexible batch-based reasoning engine method of claim 1, wherein the maximum number of parallel batches is obtained based on a longest response time for processing a reasoning request.
3. The flexible batch-based reasoning engine method of claim 1, wherein organizing the request data to be reasoning into batch data of a suitable batch size as required and waking up sub-engines of a deep neural network reasoning engine module corresponding to the batch data comprises:
detecting whether an idle sub-engine exists in an engine queue of a deep neural network reasoning engine module, and if so, selecting the number of schedulable requests to be reasoning from the video memory buffer pool;
organizing the data of the request to be inferred into batch data with proper batch size according to the idle sub-engine, the maximum parallel batch number and the number of the request to be inferred;
scheduling a sub-engine corresponding to the size of the batch data from the idle sub-engines.
4. A method as claimed in claim 3, wherein if a single sub-engine does not process the request to be inferred at a time, sub-engines corresponding to the remaining requests to be inferred are continuously acquired from idle sub-engines, and the remaining requests to be inferred are continuously processed.
5. An elastic batch-based reasoning engine system, comprising:
the video memory buffer pool module is used for acquiring and storing to-be-inferred request data input by a user; also used for: storing the data to be inferred request input by the user into a video memory buffer pool formed by continuous video memory address space; the continuous video memory address space is cut into video memory blocks with the same size, and each video memory block is used for storing data to be inferred and requested;
the reasoning engine module is used for configuring a plurality of sub-engines in the deep neural network reasoning engine module; also used for: constructing a deep neural network reasoning model; configuring sub-engines in the deep neural network reasoning engine module according to the deep neural network reasoning model; the deep neural network reasoning engine module comprises a plurality of sub-engines for processing different batch data sizes, and each sub-engine shares parameters of the deep neural network reasoning model;
the elastic batch processing scheduling module is used for acquiring the maximum parallel batch processing quantity and the quantity of the to-be-inferred requests, organizing the to-be-inferred request data into batch processing data with proper batch processing size according to the maximum parallel batch processing quantity and the quantity of the to-be-inferred requests, and waking up sub-engines corresponding to the batch processing data in the deep neural network inference engine module, wherein the sub-engines process the to-be-inferred requests.
6. An elastic batch based reasoning engine system as recited in claim 5, wherein the maximum number of parallel batches is obtained based on a longest response time for processing a reasoning request.
7. An elastic batch based inference engine system as defined in claim 5, wherein said elastic batch scheduling module comprises:
the idle engine management unit is used for detecting whether idle sub-engines exist in an engine queue of the deep neural network reasoning engine module, and if so, the number of schedulable to-be-reasoning requests is selected from the video memory buffer pool;
the batch processing data processing unit is used for organizing the data to be inferred to batch processing data with proper batch processing size according to the idle sub-engine, the maximum parallel batch processing quantity and the quantity of the requests to be inferred;
and the engine scheduling unit is used for scheduling the sub-engine corresponding to the size of the batch processing data from the idle sub-engines.
8. An electronic device comprising a processor and a memory, the memory storing program instructions that execute the processor to implement the resilient batch-based inference engine method of any one of claims 1 to 4.
CN201911088741.9A 2019-11-08 2019-11-08 Reasoning engine system and method based on elastic batch processing and electronic equipment Active CN110837419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911088741.9A CN110837419B (en) 2019-11-08 2019-11-08 Reasoning engine system and method based on elastic batch processing and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911088741.9A CN110837419B (en) 2019-11-08 2019-11-08 Reasoning engine system and method based on elastic batch processing and electronic equipment

Publications (2)

Publication Number Publication Date
CN110837419A CN110837419A (en) 2020-02-25
CN110837419B true CN110837419B (en) 2023-05-19

Family

ID=69574710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911088741.9A Active CN110837419B (en) 2019-11-08 2019-11-08 Reasoning engine system and method based on elastic batch processing and electronic equipment

Country Status (1)

Country Link
CN (1) CN110837419B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342808B (en) * 2021-05-26 2022-11-08 电子科技大学 Knowledge graph inference engine architecture system based on electromechanical equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102104626A (en) * 2009-12-22 2011-06-22 英特尔公司 Systems and methods for energy efficient load balancing at server clusters
CN107273195A (en) * 2017-05-24 2017-10-20 上海艾融软件股份有限公司 A kind of batch processing method of big data, device and computer system
CN109919315A (en) * 2019-03-13 2019-06-21 科大讯飞股份有限公司 A kind of forward inference method, apparatus, equipment and the storage medium of neural network
CN110019071A (en) * 2017-11-15 2019-07-16 北大方正集团有限公司 Data processing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102104626A (en) * 2009-12-22 2011-06-22 英特尔公司 Systems and methods for energy efficient load balancing at server clusters
CN107273195A (en) * 2017-05-24 2017-10-20 上海艾融软件股份有限公司 A kind of batch processing method of big data, device and computer system
CN110019071A (en) * 2017-11-15 2019-07-16 北大方正集团有限公司 Data processing method and device
CN109919315A (en) * 2019-03-13 2019-06-21 科大讯飞股份有限公司 A kind of forward inference method, apparatus, equipment and the storage medium of neural network

Also Published As

Publication number Publication date
CN110837419A (en) 2020-02-25

Similar Documents

Publication Publication Date Title
US11010313B2 (en) Method, apparatus, and system for an architecture for machine learning acceleration
CN110674936A (en) Neural network processing method and device, computer equipment and storage medium
CN110262901B (en) Data processing method and data processing system
KR20210011451A (en) Embedded scheduling of hardware resources for hardware acceleration
CN109165728B (en) Basic computing unit and computing method of convolutional neural network
CN112465129A (en) On-chip heterogeneous artificial intelligence processor
CN113590508B (en) Dynamic reconfigurable memory address mapping method and device
CN112799726A (en) Data processing device, method and related product
CN111752879B (en) Acceleration system, method and storage medium based on convolutional neural network
CN115686805A (en) GPU resource sharing method and device, and GPU resource sharing scheduling method and device
CN110908797B (en) Call request data processing method, device, equipment, storage medium and system
CN110837419B (en) Reasoning engine system and method based on elastic batch processing and electronic equipment
CN110738317A (en) FPGA-based deformable convolution network operation method, device and system
US11954518B2 (en) User-defined metered priority queues
WO2021061514A1 (en) Processor and interrupt controller therein
US20220300421A1 (en) Memory Sharing
CN110415162B (en) Adaptive graph partitioning method facing heterogeneous fusion processor in big data
CN114692824A (en) Quantitative training method, device and equipment of neural network model
US10073723B2 (en) Dynamic range-based messaging
CN115292053B (en) CPU, GPU and NPU unified scheduling method of mobile terminal CNN
CN117057411B (en) Large language model training method, device, equipment and storage medium
CN115543329A (en) Compiling method for optimizing regional candidate network running on artificial intelligence chip and related product thereof
CN114443259A (en) Data processing method, data processing device, computer equipment and storage medium
CN116360941A (en) Multi-core DSP-oriented parallel computing resource organization scheduling method and system
CN114692864A (en) Quantization method, quantization device, storage medium, and electronic apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant