CN113487033A

CN113487033A - Inference method and device with graphic processor as execution core

Info

Publication number: CN113487033A
Application number: CN202110874284.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Biren Intelligent Technology Co Ltd
Current assignee: Shanghai Bi Ren Technology Co ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-10-08
Anticipated expiration: 2041-07-30
Also published as: CN113487033B

Abstract

The invention relates to an inference method and a device taking a graphic processor as an execution core, wherein the inference method comprises the following steps: the graphics processor calls and executes the codes of the model from the nonvolatile storage space according to the inference command provided by the network interface controller, and is used for carrying out parallel computation on the original data provided by the network interface controller to generate a computation result; integrating the calculation results to generate an inference result; and replying the reasoning result to the network interface controller. The network interface controller is integrated by taking the graphics processor as a core to complete the reasoning operation, so that the participation of a central processor is not needed.

Description

Inference method and device with graphic processor as execution core

Technical Field

The present invention relates to a graphics processor, and more particularly, to a method and apparatus for reasoning about a graphics processor as an execution core.

Background

Traditional cloud computing needs to transmit data to a central data center for processing, and then transmits a computing result back to user equipment, so that the data center needs more and more computing capacity, and the network bandwidth connected to the data center is higher and higher. In order to solve the above-described problems, a new Edge Computing (Edge Computing) is proposed to reduce the load of the central data center and realize efficient inference. Edge computing is a network computing architecture whose computation process is as close as possible to the user equipment that provided the raw data, for reducing latency and bandwidth usage. Edge operations can be applied to artificial intelligence reasoning. In order to enable such application scenarios to be realized, the invention provides an inference method and an inference device taking a graphics processor as an execution core.

Disclosure of Invention

In view of this, how to implement artificial intelligence reasoning application is an important issue of edge operation.

The invention relates to an inference method taking a graphic processor as an execution core, which comprises the following steps: the graphics processor calls and executes the codes of the model from the nonvolatile storage space according to the inference command provided by the network interface controller, and is used for carrying out parallel computation on the original data provided by the network interface controller to generate a computation result; integrating the calculation results to generate an inference result; and replying the reasoning result to the network interface controller.

The invention relates to another inference method taking a graphic processor as an execution core, which comprises the following steps: the network interface controller receives data packets from the user equipment via the network and parses inference requests and raw data from the data packets; sending the inference command to the graphics processor according to the inference request; providing the raw data to the graphics processor, so that the graphics processor calls and executes the codes of the models according to the reasoning commands, and the codes are used for performing parallel computation on the raw data to generate a computation result and integrating the computation result to generate a reasoning result; receiving the inference result from the graphics processor; and transmitting an inference reply containing the inference result to the user equipment via the network.

The invention relates to an inference device, comprising: a calculation unit; and a command processor. The command processor is coupled with the computing unit and used for calling and executing codes of the model from the nonvolatile storage space according to the inference command provided by the network interface controller, and the code is used for performing parallel computation on the original data provided by the network interface controller through the computing unit to generate a computation result; integrating the calculation results to generate an inference result; and replying the reasoning result to the network interface controller.

One of the advantages of the above embodiments is that the inference operation is completed by integrating the network interface controller in a manner that the graphics processor is a core, and the involvement of the central processor is not required.

Another advantage of the above embodiment is that the graphics processor makes full use of built-in communication channels to access non-volatile memory.

Other advantages of the present invention will be explained in more detail in conjunction with the following description and the accompanying drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application.

FIG. 1 is a diagram of a distributed computing system according to an embodiment of the invention.

Fig. 2 is a block diagram of an edge node according to some embodiments.

FIG. 3 is a diagram of an edge computing system according to an embodiment of the invention.

Fig. 4 is a block diagram of an edge node in accordance with some embodiments of the present invention.

Fig. 5 is a block diagram of an edge node in accordance with further embodiments of the present invention.

FIG. 6 is a flowchart of an inference method using a graphics processor as a core according to an embodiment of the present invention.

Fig. 7 is a block diagram of an edge node in accordance with further embodiments of the present invention.

Wherein the symbols in the drawings are briefly described as follows:

10: a distributed computing system; 110: a cloud data center; 131. 133: an edge node; 151-156: a user equipment; 210: a network interface controller; 220: a central processing unit; 225. 235: a fast peripheral component interconnect interface; 230: a graphics processor; 250: a memory; 310: a network interface controller; 315: a microcontroller unit; 330. 410, 510: a graphics processor; 335: a built-in interface; 340: a non-volatile storage space; 350: a processing unit; 360: a memory; 370: a network interface controller; 380: a network; 411. 412: a PCIe root port; 413. 513: a command processor; 415: a memory; 416. 516: a calculation unit; 430: NVMe/NVRAM; 511: a shared memory; 512: ONFI; 530: a NAND flash memory device; s610 to S670: the method comprises the following steps of; 730: a non-volatile memory.

Detailed Description

Embodiments of the present invention will be described below with reference to the accompanying drawings. In the drawings, the same reference numerals indicate the same or similar components or process flows.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of further features, integers, steps, operations, elements, components, and/or groups thereof.

The use of words such as "first," "second," "third," etc. in this disclosure is intended to modify a component in a claim and is not intended to imply a priority order, precedence relationship, or order between components or steps in a method.

It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is described as being "directly connected" or "directly coupled" to another element, there are no intervening elements present. Other words used to describe the relationship between components may also be interpreted in a similar manner, e.g., "between" versus "directly between," or "adjacent" versus "directly adjacent," etc.

Refer to fig. 1. The decentralized computing system 10 may include a cloud data center (cloud data center)110, edge nodes (edge nodes)131 and 133, and user devices 151 to 156. The cloud data center 110 has a strong computing power and may include a server cluster (server cluster) for training various models according to a large amount of data and allowing the

edge nodes

131 and 133 to download the trained models. These models can be used in the application fields of image classification, target detection, speech synthesis, etc. For example, the image classification model may be ResNet-50, MobileNet, etc., the target detection model may be a Single Shot Multi Box Detector (SSD), Yolov3/v5, etc., and the speech synthesis model may be Tacotron 2, etc. Although the present invention has been described with reference to the above model, it is not limited to the above model, and those skilled in the art can make the

edge nodes

131 and 133 obtain other trained models from the cloud data center 110. Any of the

edge nodes

131 and 133 may be implemented using a mainframe, workstation, industrial computer, personal computer, or the like. Each edge node is as close as possible to the user device providing the raw data and is capable of serving multiple user devices, e.g., edge node 131 may provide user devices 151-153 inference services in accordance with the trained model, while edge node 133 may provide user devices 154-156 inference services in accordance with the trained model. Any of the user devices 151 to 156 may issue an inference request (inference request) and transmit raw data (e.g., an image, a text, etc.) to the corresponding edge node. The inference request may contain a specified model, how to operate the model, and other necessary information for requesting this edge node to perform a specific operation on the raw data with the specified model to generate the inference result. The edge node then replies to the inference results (e.g., classification results, voice data, etc.) to the user device that issued the inference request.

In some embodiments of the edge node, referring to fig. 2, a Central Processing Unit (CPU) 220 serves as a host processor for controlling the execution of the whole operation, and a Graphics Processing Unit (GPU) 230 serves as an accelerator (accelerator) for performing calculation using a model specified in the inference request to generate a calculation result. In detail, a non-volatile random access memory (NVRAM) 240 stores a plurality of models acquired from the cloud data center 110. The central processor 220 receives the inference command through a Network Interface Controller (NIC) 240, reads a specified model from the non-volatile random access memory 240 through a peripheral component interconnect express (PCIe) interface 225 according to information carried in the inference command, and offloads (offload) a series of instructions to the graphics processor 230. The graphic processor 230 performs calculation according to the unloaded instruction and stores the calculation result in the memory 250. The central processor 220 reads the calculation results from the memory 250 to generate inference results, and replies the inference results to the user device that issued the inference request through the network interface controller 240. However, although the graphics processor 230 is responsible for a large amount of computation work of the model, the central processor 220 still needs to spend time and resources to obtain the progress of the graphics processor 230 and wait for the computation result of the graphics processor 230, which reduces the utilization rate (utilization) of the central processor 220. In addition, graphics processor 230 is also equipped with a PCIe interface 235, but is not used, causing hardware idleness.

With respect to the above-mentioned embodiments, the present invention provides an embodiment of an edge computing system, referring to fig. 3, including any one of the user devices 151 to 156 and the

corresponding edge node

131 or 133. The user device may be an electronic product such as a personal computer, a notebook computer (Laptop PC), a tablet computer, a mobile phone, a digital camera, a digital video camera, etc., comprising at least a processing unit 350, a memory 360 and a network interface controller 370 (also referred to as a network card). The processing unit 350, when loading and executing the appropriate program code, performs the operation of the artificial intelligence application, and the memory 360 stores variables, data tables, and management information required by the artificial intelligence application. During the operation of the artificial intelligence application, the processing unit 350 may drive the network interface controller 370 to issue a management request to the edge node via the network 380 for requesting the edge node to update the model, such as updating the image classification model from ResNet-50 to MobileNet, or obtaining the model information currently in use, etc., and then may obtain the requested model information, the execution result of the request, etc., from the edge node. The processing unit 350 may also drive the network interface controller 370 to issue inference requests to the edge nodes via the network 380 for requesting the edge nodes to operate on the uploaded raw data using the specified model and generate inference results, which may then be obtained from the edge nodes. The user equipment may communicate with the edge node using a specific communication Protocol, such as HyperText Transfer Protocol (HTTP), HyperText Transfer security Protocol (HTTPs), Wireless Application Protocol (WAP), and the like.

The edge node includes at least a network interface controller 310, a graphics processor 330, and a non-volatile memory space 340. Since the graphics processor 330 is more suitable for Single Instruction Multiple Data (SIMD) concurrent operations, such as SIMD (Single Instruction Multiple Data) instructions, SIMT (Single Instruction Multiple Thread) techniques, etc., in the application of artificial intelligence, the edge node directly treats the graphics processor 330 as a main processor rather than as an accelerator. The graphic processor 330 not only calculates raw data using the model specified in the inference request to generate a calculation result, but also controls the execution process of the entire operation. In some embodiments, the edge nodes do not have the central processor involved in performing operations to respond to inference requests. In detail, the network interface controller 310 includes a Micro Controller Unit (MCU) 315 for loading and executing appropriate program codes to complete the operation of the offload engine (offload engine). During operation of the offload engine, microcontroller unit 315 may retrieve data packets received via network 380 in accordance with a particular communication protocol, parse the information in the data packets to retrieve the inference request and raw data transmitted by the user device, and send an inference command and raw data to graphics processor 330 instructing graphics processor 330 to compute the raw data using a particular model and parameters to complete a specified inference operation. After the micro-controller unit 315 obtains the inference result from the graphics processor 330, the inference result is loaded in an inference reply and the inference reply is transmitted to the user device that issued the inference request.

The Non-Volatile Memory space 340 may be implemented using Non-Volatile flash Memory (NVMe), Non-Volatile Random Access Memory (NVRAM), NAND flash Memory, etc., for storing code of firmware (firmware) or kernel binary (kernel binary) capable of operating a model to complete inference, and storing a plurality of models acquired from the cloud data center 110. The edge node may be specially designed with a file system (filesystem) to enable the graphics processor 330 to access data in the non-volatile random access memory 340 directly through a built-in Interface 335, where the built-in Interface 335 may be a PCIe Interface, an Open NAND Flash Interface (ONFI), or the like. Graphics processor 330 may load and execute code of firmware or program core binaries from non-volatile storage space 340, including control flow to perform inference operations. Under the control of the control flow, code specifying a model is loaded and executed from non-volatile storage space 340 in response to inference commands received from network interface controller 310 for performing various parallel computations to generate inference results. The graphics processor 330 then replies with the inference results to the network interface controller 310. In addition to performing inference operations, graphics processor 330 is also capable of performing other application tasks, including but not limited to: linear and non-linear data transformations, database operations, big data operations, encoding, decoding, modeling operations, image rendering operations, etc. of audio and video data.

Fig. 4 shows a block diagram of an edge node in accordance with some embodiments of the invention. The edge node contains the network interface controller 310, graphics processor 410, and NVMe or NVRAM 430 as described above. NVMe/NVRAM 430 provides non-volatile storage space for storing code commands for firmware, including control flow to complete inference operations, and code for multiple models obtained from cloud data center 110.

Graphics processor 410 includes PCIe Root Ports (RP) 411, 412, command processor 413, memory 415, and multiple Compute Units (CUs) 416. The command processor 413 contains a Root Complex of the PCIe specification (Root Complex) for connecting the network interface controller 310 and NVMe/NVRAM 430 (which may be referred to as PCIe devices) through

PCIe Root ports

411 and 412, respectively. The command handler 413 loads and executes the firmware code, including the control flow to complete the inference operation, from the NVMe/NVRAM 430 through the PCIe root port 412 (which may be referred to as a second PCIe root port), and receives the inference command and the raw data from the network interface controller 310 through the PCIe root port 411 (which may be referred to as a first PCIe root port) during execution of the firmware code. The command handler 413 stores the raw data in memory 415 and calls and executes the code of the specified model from NVMe/NVRAM 430 through PCIe root port 412 according to the inference command. During execution of the model code, command processor 413 issues CUs 416 a plurality of calculation codes and arguments (arguments) to indicate CUs 416 that specific parallel calculations are to be performed by processor 416, the arguments including the source address of the original data stored in memory 415 and the destination address of the calculation result stored in memory 415. Each CU 416 reads raw data from memory 415 according to the source address, performs specified computations on the raw data according to the computation code, and writes the computation results to memory 415 according to the destination address, and informs command processor 413 that the particular computation has been completed. Computations that each CU 416 may perform include addition and multiplication of integers, floating point numbers, compare operations, Boolean operations, bit shifts, algebraic functions (e.g., planar interpolation, trigonometric functions, exponential functions, logarithmic functions), and so forth. Under the management of the control flow of the inference operation, the command processor 413 unifies the calculation results in the memory 415 to generate an inference result, and replies the inference result to the network interface controller 310 through the PCIe root port 411.

In other embodiments of edge nodes, those skilled in the art may modify the architecture of FIG. 4 to place non-volatile memory in graphics processor 410, allowing command processor 413 to directly access data in the non-volatile memory without going through PCIe root port 412. Referring to fig. 7, the graphics processor 410 includes a non-volatile memory 730 for storing code commands for firmware, including control flow to perform inference operations, and code for multiple models obtained from the cloud data center 110. Non-volatile memory 730 includes, but is not limited to, NVRAM. The command handler 413 loads and executes firmware code, including the control flow to complete the inference operation, directly from the non-volatile memory 730. Other technical detail classes can refer to the corresponding descriptions in fig. 4, and are not described again for brevity.

Fig. 5 shows a block diagram of an edge node according to further embodiments of the present invention. The edge node includes a network interface controller 310, a graphics processor 510, and a NAND flash device 530 as described above. NAND flash device 530 provides non-volatile storage space for storing code commands for firmware and code for multiple models obtained from cloud data center 110. The network interface controller 310 may comprise a Direct Memory Access (DMA) controller for storing inference commands, parameters and raw data, as well as inference results, at specified addresses in the shared memory 511.

The graphics processor 510 includes a shared memory (shared memory)511, ONFI 512, a command processor 513, and a plurality of computing units 516. The command handler 513 loads and executes firmware code, including a control flow to complete the inference operation, from the NAND flash device 530 through the ONFI 512, and receives an inference command and raw data from a designated address of the shared memory 511 during execution of the firmware code. The command handler 513 calls and executes the code of the specified model from the NAND flash device 530 through the ONFI 512 in accordance with the inference command. During execution of the model code, command processor 513 issues CUs 516 a plurality of calculation codes and arguments, including the source address of the original data stored in shared memory 511 and the destination address of the calculation result stored in shared memory 511, indicating CUs 516 completion of the particular parallel calculation. Each CU 416 reads raw data from the shared memory 511 according to the source address, performs specified calculations on the raw data according to calculation codes, and writes calculation results to the shared memory 511 according to the destination address, and notifies the command processor 513 of information that a specific calculation has been completed. Each CU 516 may perform computations as described by CU 416. Under the management of the control flow of the inference operation, the command processor 513 unifies the calculation results in the shared memory 511 to generate an inference result. The network interface controller 310 reads the inference result from the specified address in the shared memory 511.

FIG. 6 is a flowchart of a graphics processor-based inference method implemented by a

graphics processor

310, 410, or 510 in combination with a network interface controller 310 according to an embodiment of the invention. The detailed steps are as follows:

step S610: a command handler in the graphics processor 330 initializes a communication channel of the network interface controller 310 and a communication channel of the nonvolatile memory space 340. Referring also to the embodiment shown in FIG. 4, command processor 413 initializes

PCIe root ports

411 and 412. Referring also to the embodiment shown in fig. 5, the command handler 513 initializes the ONFI 512 and configures space in the shared memory 511 for use by the network interface controller 310.

Step S620: the command handler in the graphics processor 310 loads and executes the code of the firmware from the non-volatile memory space 340. Referring also to the embodiment shown in FIG. 4, command processor 413 loads and executes the code of the firmware from NVMe/NVRAM 430 through PCIe root port 412. Referring also to the embodiment shown in FIG. 5, the command processor 513 loads and executes the code of the firmware from the NAND flash device 530 through the ONFI 512.

Step S630: the microcontroller unit 315 in the network interface controller 310 loads and executes the code of the offload engine.

Step S640: the microcontroller unit 315, when executing the code of the offload engine, receives data packets from the network 380, parses inference requests and raw data from the data packets, and completes the offload processing accordingly. In the offload process, and with additional reference to the embodiment shown in FIG. 4, the micro-controller unit 315 sends an inference command to the graphics processor 410 via the PCIe root port 411 upon receiving an inference request. In the offload process, and with additional reference to the embodiment shown in fig. 5, the micro-controller unit 315 stores the inference command at a specified address in the shared memory 511 upon receipt of the inference request.

Step S650: the micro-controller unit 315 provides raw data to the graphics processor 330 when executing code for the offload engine. Referring also to the embodiment shown in FIG. 4, microcontroller unit 315 provides raw data to graphics processor 410 through PCIe root port 411. Referring also to the embodiment shown in FIG. 5, the microcontroller unit 315 stores the raw data at a specified address in the shared memory 511.

Step S660: the command handler in the graphic processor 330, when executing the code of the firmware, calls and executes the code of the designated model from the nonvolatile memory space 340 according to the inference command provided by the network interface controller 310, for performing parallel computation on the raw data provided by the network interface controller 310 to generate a computation result, integrating the computation result to generate an inference result, and returning the inference result to the network interface controller 310. Referring also to the embodiment shown in FIG. 4, command processor 413 calls and executes the code specifying the model from NVMe/NVRAM 430 through PCIe root port 412, performs parallel computations on the raw data through CUs 416, and replies the inference results to network interface controller 310 through PCIe root port 411. Referring also to the embodiment shown in FIG. 5, the command handler 513 calls and executes code of the specified model from the NAND flash device 530 through the ONFI 512, performs parallel computations on the raw data through CUs 516, and the network interface controller 310 reads the inference results from the specified address in the shared memory 511.

Step S670: the microcontroller unit 315 transmits an inference reply to the user device that issued the inference request.

All or part of the steps of the method of the present invention may be implemented by a computer program, such as a program core, a driver, and the like. In addition, other types of programs as shown above may also be implemented. Those skilled in the art can write the method of the embodiments of the present invention as program code, and will not be described again for the sake of brevity. The computer program implemented according to the embodiments of the present invention can be stored in a suitable computer readable storage medium, such as a DVD, a CD-ROM, a usb disk, a hard disk, or can be disposed in a network server accessible via a network (e.g., the internet, or other suitable medium).

Although the above-described elements are included in fig. 3 to 5, it is not excluded that more additional elements may be used to achieve better technical results without departing from the spirit of the present invention. Further, although the flowchart of fig. 6 is executed in the order specified, a person skilled in the art may modify the order between the steps to achieve the same effect without departing from the spirit of the invention, and therefore, the invention is not limited to use of only the order described above. In addition, a person skilled in the art may also integrate several steps into one step, or perform more steps in sequence or in parallel besides these steps, and the present invention should not be limited thereby.

The above description is only for the preferred embodiment of the present invention, and it is not intended to limit the scope of the present invention, and any person skilled in the art can make further modifications and variations without departing from the spirit and scope of the present invention, therefore, the scope of the present invention should be determined by the claims of the present application.

Claims

1. An inference method using a graphics processor as an execution core, comprising:

the graphics processor calls and executes the code of the model from the nonvolatile storage space according to the inference command provided by the network interface controller, and is used for carrying out parallel computation on the original data provided by the network interface controller to generate a computation result;

the graphics processor integrates the calculation results to generate an inference result; and

and the graphics processor replies the inference result to the network interface controller.

2. A reasoning method according to claim 1, wherein the non-volatile memory space comprises at least one of a non-volatile flash memory and a non-volatile random access memory, the graphics processor invoking and executing code of the model from the non-volatile flash memory or the non-volatile random access memory through a built-in fast peripheral component interconnect root port.

3. The inference method of claim 1, wherein the non-volatile storage space comprises a NAND flash device from which the graphics processor invokes and executes code of the model through a built-in open NAND flash interface.

4. The inference method of claim 1, wherein the non-volatile memory space comprises a non-volatile memory disposed inside the graphics processor.

5. The inference method of claim 1, comprising:

the graphics processor initializes a first communication channel corresponding to the network interface controller and a second communication channel corresponding to the non-volatile storage space before invoking and executing code of the model.

6. The inference method of claim 5, wherein the first communication channel is a first peripheral component interconnect express root port, and the second communication channel is a second peripheral component interconnect express root port different from the first peripheral component interconnect express root port.

7. The inference method of claim 5, wherein the first communication channel is a shared memory of the graphics processor, and the second communication channel is an open NAND flash interface.

8. The inference method of claim 1, comprising:

the network interface controller receiving a data packet from a user device via a network and parsing an inference request and the raw data from the data packet;

the network interface controller sends the inference command to the graphics processor according to the inference request;

the network interface controller provides the raw data to the graphics processor;

the network interface controller receiving the inference result from the graphics processor; and

the network interface controller transmits an inference reply containing the inference result to the user device via the network.

9. The inference method of claim 8, wherein there is no central processor involved in the process of the graphics processor and the network interface controller performing the inference method.

10. An inference method using a graphics processor as an execution core, comprising:

the network interface controller receives data packets from the user equipment via the network and parses inference requests and raw data from the data packets;

the network interface controller sends an inference command to the graphics processor according to the inference request;

the network interface controller provides the original data to the graphics processor, and the graphics processor calls and executes codes of a model according to the reasoning command, and is used for performing parallel computation on the original data to generate a computation result and integrating the computation result to generate a reasoning result;

11. The inference method of claim 10, wherein said network interface controller transmits said inference command and said raw data to said graphics processor through a fast peripheral component interconnect root port of said graphics processor and/or receives said inference result from said graphics processor.

12. An inference method according to claim 10, wherein said network interface controller stores said inference command and said raw data to a specified address in a shared memory of said graphics processor, and/or reads said inference result from a specified address in said shared memory.

13. An inference apparatus, comprising:

a calculation unit; and

the command processor is coupled with the computing unit and used for calling and executing codes of the model from the nonvolatile storage space according to the inference command provided by the network interface controller, and the code is used for performing parallel computation on the original data provided by the network interface controller through the computing unit to generate a computation result; integrating the calculation results to generate an inference result; and replying the reasoning result to the network interface controller.

14. The inference apparatus of claim 13, comprising:

a second fast peripheral component interconnect root port coupled to the command processor and the non-volatile memory space;

wherein the non-volatile storage space includes at least one of a non-volatile flash memory and a non-volatile random access memory, and the command processor is to invoke the model from the non-volatile flash memory or the non-volatile random access memory through the second fast peripheral component interconnect root port.

15. The inference apparatus of claim 14, comprising:

a first fast peripheral component interconnect root port coupled to the command processor and the network interface controller, the command processor configured to receive the inference command and the raw data from the network interface controller through the first fast peripheral component interconnect root port;

the command processor is used for initializing the first quick peripheral component interconnection root port and the second quick peripheral component interconnection root port before calling and executing the code of the model.

16. The inference apparatus of claim 15, wherein the computation unit, the command processor, the first fast peripheral component interconnect root port, and the second fast peripheral component interconnect root port comprise a graphics processor.

17. The inference apparatus of claim 13, comprising:

an open NAND flash memory interface coupling the command processor and the non-volatile memory space;

wherein the non-volatile storage space includes a NAND flash device, and the command processor is to invoke the code of the model from the NAND flash device through the open NAND flash interface.

18. The inference apparatus of claim 17, comprising:

a shared memory coupled to the command processor and the network interface controller;

wherein the command handler is configured to initialize the open NAND flash memory interface and to provide the shared memory configuration space to the network interface controller for use before invoking and executing code of the model.

19. The inference apparatus of claim 18, wherein the computation unit, the command processor, the open NAND flash interface, and the shared memory constitute a graphics processor.

20. The inference apparatus of claim 13, comprising:

the network interface controller is used for receiving a data packet from user equipment through a network and analyzing an inference request and the original data from the data packet; sending the inference command to the command processor according to the inference request; providing the raw data to the command processor; receiving the inference result from the command processor; and transmitting an inference reply containing the inference result to the user equipment via the network.

21. The inference engine of claim 20, wherein the computation unit and the command processor constitute a graphics processor, and no central processor is involved in the inference engine.

22. The inference engine of claim 13, wherein the non-volatile memory space comprises non-volatile memory disposed within the inference engine.