TW202119255A

TW202119255A - Inference system, inference method, electronic device and computer storage medium

Info

Publication number: TW202119255A
Application number: TW109128235A
Authority: TW
Inventors: 林立翔; 李鵬; 游亮; 龍欣
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2019-11-08
Filing date: 2020-08-19
Publication date: 2021-05-16
Also published as: CN112784989B; CN112784989A; WO2021088964A1

Abstract

Provided are an inference system, an inference method, an electronic device and a computer storage medium. The inference system comprises a first computing device and a second computing device that are connected to each other, with the first computing device being provided with an inference client, and the second computing device comprising an inference acceleration resource and an inference server, wherein the inference client is used for acquiring model information of a computing model for inference and data to be inferred, and for respectively sending the model information and said data to the inference server in the second computing device; and the inference server is used for loading and calling, by means of the inference acceleration resource, the computing model indicated by the model information, and performing, by means of the computing model, inference processing on said data, and feeding back the result of the inference processing to the inference client.

Description

Reasoning system, reasoning method, electronic equipment and computer storage medium

本發明實施例涉及電腦技術領域，尤其涉及一種推理系統、推理方法、電子設備及電腦儲存媒體。The embodiments of the present invention relate to the field of computer technology, and in particular to an inference system, an inference method, an electronic device, and a computer storage medium.

深度學習一般分為訓練和推理(Inference)兩個部分，其中，透過訓練部分搜尋和求解模型的最優參數，而透過推理部分則可以將訓練完成的模型部署在線上環境中，以進行實際使用。以人工智慧領域為例，推理在部署後，可透過神經網路推導計算將輸入轉化為特定目標輸出。例如，對圖片進行物體檢測、對文字內容進行分類等，在視覺、語音、推薦等場景被廣泛應用。目前，大部分的推理依賴於具有推理加速卡如GPU(Graphics Processing Unit，圖形處理器)的硬體計算資源。例如，在人工智慧推理中，一種方式是GPU透過PCIE (Peripheral Component Interconnect Express，高速串行電腦擴展匯流排標準)插槽與電腦主機連接。其中，推理涉及的前後處理和其他業務邏輯透過CPU計算，而推理的處理則透過PCIE插槽發送到GPU進行計算，形成典型的異構計算場景。例如，在圖1所示的電子設備100 中，同時設置有CPU102 和GPU104 ，GPU104 可以透過PCIE插槽106 設置於電子設備主機板108 上，並透過主機板108 上的主機板線路與CPU102 互動。在一個推理過程中，CPU102 首先對相關資料或資訊進行處理，進而將處理後的資料或資訊透過PCIE插槽106 發送到GPU104 ，GPU104 根據接收的資料或資訊，使用GPU104 中的計算模型進行推理處理，之後，再將推理處理結果返回給CPU102 ，CPU102 再進行相應的後續處理。但是，上述方式存在以下問題：需要CPU和GPU同台的異構計算機器，且該異構計算機器中的CPU/GPU的規格固定，這種固定的CPU/GPU性能配比限制了涉及推理的應用的部署，導致無法滿足廣泛的推理場景需求。Deep learning is generally divided into two parts: training and inference. The training part is used to search for and solve the optimal parameters of the model, and through the inference part, the trained model can be deployed in the online environment for actual use. . Take the field of artificial intelligence as an example. After inference is deployed, the input can be converted into a specific target output through neural network deduction calculations. For example, object detection on pictures, classification of text content, etc., are widely used in scenes such as vision, voice, and recommendation. At present, most inferences rely on hardware computing resources with inference acceleration cards such as GPU (Graphics Processing Unit, graphics processing unit). For example, in artificial intelligence inference, one way is to connect the GPU to the host computer through a PCIE (Peripheral Component Interconnect Express, high-speed serial computer expansion bus standard) slot. Among them, the pre-processing and other business logic involved in the inference are calculated by the CPU, and the processing of the inference is sent to the GPU for calculation through the PCIE slot, forming a typical heterogeneous computing scenario. For example, in the electronic device 100 shown in FIG. 1, a CPU 102 and a GPU 104 are provided at the same time, and the GPU 104 can be set on the electronic device motherboard 108 through the PCIE slot 106 , and through the motherboard circuit on the motherboard 108 The CPU 102 interacts. In a reasoning process, CPU 102 first of all relevant data or information to be processed, and then the processed data or information sent to the GPU 104 through the PCIE slot 106, GPU 104 according to the received data or information, calculated using the GPU 104 The model performs inference processing, and then returns the inference processing result to the CPU 102 , and the CPU 102 performs corresponding subsequent processing. However, the above method has the following problems: a heterogeneous computing machine with the same CPU and GPU is required, and the specifications of the CPU/GPU in the heterogeneous computing machine are fixed. This fixed CPU/GPU performance ratio limits the reasoning involved The deployment of applications makes it impossible to meet the needs of a wide range of reasoning scenarios.

有鑒於此，本發明實施例提供一種推理方案，以解決上述部分或全部問題。根據本發明實施例的第一方面，提供了一種推理系統，包括相互連接的第一計算設備和第二計算設備，其中，所述第一計算設備中設置有推理用戶端，所述第二計算設備中設置有推理加速資源以及推理伺服端；其中：所述推理用戶端用於獲取進行推理的計算模型的模型資訊和待推理資料，並分別將所述模型資訊和所述待推理資料發送至所述第二計算設備中的推理伺服端；所述推理伺服端用於透過推理加速資源載入並呼叫所述模型資訊指示的計算模型，透過所述計算模型對所述待推理資料進行推理處理並向所述推理用戶端回饋所述推理處理的結果。根據本發明實施例的第二方面，提供了一種推理方法，所述方法包括：獲取進行推理的計算模型的模型資訊，並將所述模型資訊發送至目標計算設備，以指示所述目標計算設備使用所述目標計算設備中設置的推理加速資源載入所述模型資訊指示的計算模型；獲取待推理資料，並將所述待推理資料發送至所述目標計算設備，以指示所述目標計算設備使用推理加速資源呼叫載入的所述計算模型，透過所述計算模型對所述待推理資料進行推理處理；接收所述目標計算設備回饋的所述推理處理的結果。根據本發明實施例的第三方面，提供了另一種推理方法，所述方法包括：獲取源計算設備發送的用於推理的計算模型的模型資訊，透過推理加速資源載入所述模型資訊指示的計算模型；獲取所述源計算設備發送的待推理資料，使用推理加速資源呼叫載入的所述計算模型，透過所述計算模型對所述待推理資料進行推理處理；向所述源計算設備回饋所述推理處理的結果。根據本發明實施例的第四方面，提供了一種電子設備，包括：處理器、記憶體、通訊介面和通訊匯流排，所述處理器、所述記憶體和所述通訊介面透過所述通訊匯流排完成相互間的通訊；所述記憶體用於存放至少一可執行指示，所述可執行指示使所述處理器執行如第二方面所述的推理方法對應的操作，或者，所述可執行指示使所述處理器執行如第三方面所述的推理方法對應的操作。根據本發明實施例的第五方面，提供了一種電腦儲存媒體，其上儲存有電腦程式，該程式被處理器執行時實現如第二方面所述的推理方法；或者，實現如第三方面所述的推理方法。根據本發明實施例提供的推理方案，將推理處理部署在不同的第一和第二計算設備中，其中，第二計算設備中設置有推理加速資源，可以透過計算模型進行主要的推理處理，而第一計算設備則可以負責推理處理之前和之後的資料處理。並且，第一計算設備中部署有推理用戶端，第二計算設備中部署有推理伺服端，在進行推理時，第一計算設備和第二計算設備透過推理用戶端和推理伺服端進行互動。推理用戶端可以先將計算模型的模型資訊發送給推理伺服端，推理伺服端使用推理加速資源載入相應的計算模型；接著，推理用戶端向推理伺服端發送待推理資料，推理伺服端在接收到待推理資料後，即可透過載入的計算模型進行推理處理。由此，實現了推理所使用的計算資源的解耦，透過計算模型進行的推理處理和推理處理之外的資料處理可以透過不同的計算設備實現，其中一台配置有推理加速資源如GPU即可，無需一台電子設備上同時具有CPU和GPU，有效解決了因現有異構計算機器的CPU/GPU的規格固定，而使涉及推理的應用的部署受限，導致無法滿足廣泛的推理場景需求的問題。此外，對於使用者來說，其在使用涉及推理的應用時，推理計算可以透過推理用戶端和推理伺服端無縫轉接到遠端具有推理加速資源的設備上進行，且推理用戶端和推理伺服端之間的互動對於使用者是無感知的，因此，可以保證涉及推理的應用的業務邏輯和使用者進行推理業務的使用習慣不變，低成本地實現了推理且提升了使用者使用體驗。In view of this, the embodiment of the present invention provides a reasoning solution to solve some or all of the above-mentioned problems. According to a first aspect of the embodiments of the present invention, there is provided an inference system, including a first computing device and a second computing device connected to each other, wherein the first computing device is provided with an inference client, and the second computing device The equipment is equipped with inference acceleration resources and inference servers; wherein: the inference client is used to obtain model information and data to be inferred of the calculation model for inference, and respectively send the model information and the data to be inferred to The inference server in the second computing device; the inference server is used to load and call the calculation model indicated by the model information through the inference acceleration resource, and perform inference processing on the data to be inferred through the calculation model And feedback the result of the reasoning processing to the reasoning client. According to a second aspect of the embodiments of the present invention, there is provided an inference method, the method comprising: obtaining model information of a calculation model for inference, and sending the model information to a target computing device to instruct the target computing device Use the inference acceleration resource set in the target computing device to load the calculation model indicated by the model information; obtain the data to be inferred, and send the data to be inferred to the target computing device to instruct the target computing device Use the calculation model loaded by the inference acceleration resource call to perform inference processing on the data to be inferred through the calculation model; receive the result of the inference processing as feedback from the target computing device. According to a third aspect of the embodiments of the present invention, another reasoning method is provided, the method comprising: acquiring model information of a calculation model used for reasoning sent by a source computing device, and loading the information indicated by the model information through the reasoning acceleration resource Calculation model; obtain the data to be inferred sent by the source computing device, use the inference acceleration resource to call the loaded calculation model, and perform inference processing on the data to be inferred through the calculation model; feedback to the source computing device The result of the inference process. According to a fourth aspect of the embodiments of the present invention, there is provided an electronic device including: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface pass through the communication bus. The memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform the operation corresponding to the reasoning method described in the second aspect, or the executable Instruct the processor to perform operations corresponding to the inference method described in the third aspect. According to a fifth aspect of the embodiments of the present invention, there is provided a computer storage medium on which a computer program is stored, and when the program is executed by a processor, the reasoning method as described in the second aspect is realized; or, as described in the third aspect. The reasoning method described. According to the reasoning scheme provided by the embodiment of the present invention, the reasoning process is deployed in different first and second computing devices, wherein the second computing device is provided with inference acceleration resources, and the main reasoning process can be performed through the calculation model, and The first computing device can then be responsible for data processing before and after inference processing. In addition, an inference client is deployed in the first computing device, and an inference server is deployed in the second computing device. During inference, the first computing device and the second computing device interact through the inference client and the inference server. The inference client can first send the model information of the calculation model to the inference server, and the inference server uses the inference acceleration resource to load the corresponding calculation model; then, the inference client sends the data to be inferred to the inference server, and the inference server is receiving After the data to be inferred, inference processing can be performed through the loaded calculation model. In this way, the decoupling of computing resources used for inference is realized. Inference processing through the computing model and data processing other than inference processing can be realized through different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem of the fixed CPU/GPU specifications of the existing heterogeneous computing machines, which restricts the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios. problem. In addition, for users, when using applications involving reasoning, the reasoning calculation can be seamlessly transferred to the remote device with reasoning acceleration resources through the reasoning client and the reasoning server, and the reasoning client and reasoning The interaction between the servers is imperceptible to the user. Therefore, it can ensure that the business logic of the application involving reasoning and the user's use habits for reasoning business remain unchanged, and the reasoning is realized at low cost and the user experience is improved. .

為了使本領域的人員更好地理解本發明實施例中的技術方案，下面將結合本發明實施例中的圖式，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅是本發明實施例一部分實施例，而不是全部的實施例。基於本發明實施例中的實施例，本領域普通技術人員所獲得的所有其他實施例，都應當屬於本發明實施例保護的範圍。下面結合本發明實施例圖式進一步說明本發明實施例具體實現。實施例一參照圖2a，示出了根據本發明實施例一的一種推理系統的結構方塊圖。本實施例的推理系統包括相互連接的第一計算設備202 和第二計算設備204 ，其中，第一計算設備202 中設置有推理用戶端2022 ，第二計算設備204 中設置有推理伺服端2042 以及推理加速資源2044 。其中，推理用戶端2022 用於獲取進行推理的計算模型的模型資訊和待推理資料，並分別將所述模型資訊和所述待推理資料發送至所述第二計算設備204 中的推理伺服端2042 ；推理伺服端2042 用於透過推理加速資源2044 載入並呼叫所述模型資訊指示的計算模型，透過所述計算模型對所述待推理資料進行推理處理並向推理用戶端2022 回饋所述推理處理的結果。在一種可行的實現中，第一計算設備202 中的推理用戶端2022 先獲取進行推理的計算模型的模型資訊，並將所述模型資訊發送至第二計算設備204 中的推理伺服端2042 ；第二計算設備204 中的推理伺服端2042 透過推理加速資源2044 載入所述模型資訊指示的計算模型；第一計算設備202 中的推理用戶端2022 再獲取待推理資料，並將待推理資料發送至第二計算設備204 中的推理伺服端2042 ；第二計算設備204 中的推理伺服端2042 使用推理加速資源2044 呼叫載入的所述計算模型，透過所述計算模型對待推理資料進行推理處理並向推理用戶端2022 回饋所述推理處理的結果。上述推薦系統中，第二計算設備204 因具有推理使用的推理加速資源2044 ，可以有效加載用於推理的計算模型和進行大資料量的推理計算。又因推理加速資源2044 所在的第二計算設備204 與第一計算設備202 相互獨立設置，推理加速資源2044 如GPU無需和第一計算設備202 中的處理器資源如CPU遵循固定的規格設置，使得推理加速資源2044 的實現更為靈活和多樣。其中，推理加速資源2044 可以實現為：包括但不限於GPU、NPU的多種形式。從而，第一計算設備中202 中配置常規處理資料的資源如CPU即可。進行推理的計算模型可以為根據業務需求設置的任意適當的計算模型，其可適用於深度學習框架(包括但不限於Tensorflow框架、Mxnet框架、PyTorch框架)即可。在一種可行方式中，第二計算設備204 中可以預先設置有計算模型的資源池，若所需使用的計算模型在資源池中，則可直接加載使用；若不在資源池中，則可從第一計算設備202 中獲取。在另一種可行方式中，第二計算設備204 中可以沒有預先設置的計算模型的資源池，在需要進行推理時，從第一計算設備202 獲取所需的計算模型，進而儲存在本地。經過多次推理，可以獲得不同的計算模型並最終儲存成為計算模型的資源池。獲得的不同的計算模型可以來自於不同的第一計算設備202 ，也即，第二計算設備204 可以為不同的第一計算設備202 提供推理服務，以從不同的第一計算設備202 獲得不同的計算模型。推理用戶端2022 向推理伺服端2042 發送的計算模型的模型資訊可以唯一標識計算模型，如，可以為計算模型的標識資訊如標識ID號。但不限於此，在一種可行方式中，計算模型的模型資訊也可以是計算模型的校驗資訊，如MD5資訊，校驗資訊一方面可以標識計算模型，另一方面還可以進行計算模型的校驗，透過一種資訊實現多種功能，降低了資訊處理的成本。模型資訊可以在第一計算設備202 加載模型時獲得。以下，以一個具體實例對上述推理系統的結構進行示例性說明，如圖2b所示。圖2b中，第一計算設備202 實現為一台終端設備即第一終端設備，其中設置有處理器CPU以進行相應的業務處理，第二計算設備204 也實現為一台終端設備即第二終端設備，其中設置有推理加速資源GPU。並且，第一計算設備202 中加載有深度學習框架，以及設置於深度學習框架中的推理用戶端；第二計算設備204 中則對應設置有推理伺服端。本實施例中，還設定第二計算設備204 中設置有計算模型的資源池，其中儲存有多個計算模型，如計算模型A、B、C和D。本領域技術人員應當明白的是，上述實例僅為示例性說明，在實際應用中，第一計算設備202 和第二計算設備204 可以均實現為終端設備，或者可以均實現為伺服器，或者，第一計算設備202 實現為伺服器而第二計算設備204 實現為終端設備或者相反，本發明實施例對此不作限制。基於圖2b的推理系統，一種使用該推理系統進行推理的過程如下。以影像識別為例，深度學習框架載入模型時即可獲得相應的模型資訊，推理用戶端將計算模型的資訊先發送給第二終端設備，第二終端設備透過其中的推理伺服端接收該計算模型的資訊。假設，該計算模型的資訊指示待使用的計算模型為計算模型A，第二終端設備的資源池中儲存有計算模型A、B、C和D，則，第二終端設備會透過GPU直接從資源池中加載計算模型A。進而，第二終端設備透過推理伺服端和推理用戶端，從第一終端設備獲取待推理資料如待識別的影像，然後第二終端設備透過GPU呼叫計算模型A對該影像進行目標對象識別，如識別影像中是否有人像。在識別結束後，第二終端設備會將識別結果透過推理伺服端發送給第一終端設備的推理用戶端，由推理用戶端交由CPU進行後續處理，如增加AR特效等。需要說明的是，本發明實施例中，若無特殊說明，諸如“多個”、“多種”等與“多”有關的數量，意指兩個及兩個以上。根據本實施例提供的推理系統，將推理處理部署在不同的第一和第二計算設備中，其中，第二計算設備中設置有推理加速資源，可以透過計算模型進行主要的推理處理，而第一計算設備則可以負責推理處理之前和之後的資料處理。並且，第一計算設備中部署有推理用戶端，第二計算設備中部署有推理伺服端，在進行推理時，第一計算設備和第二計算設備透過推理用戶端和推理伺服端進行互動。推理用戶端可以先將計算模型的模型資訊發送給推理伺服端，推理伺服端使用推理加速資源載入相應的計算模型；接著，推理用戶端向推理伺服端發送待推理資料，推理伺服端在接收到待推理資料後，即可透過載入的計算模型進行推理處理。由此，實現了推理所使用的計算資源的解耦，透過計算模型進行的推理處理和推理處理之外的資料處理可以透過不同的計算設備實現，其中一台配置有推理加速資源如GPU即可，無需一台電子設備上同時具有CPU和GPU，有效解決了因現有異構計算機器的CPU/GPU的規格固定，而使涉及推理的應用的部署受限，導致無法滿足廣泛的推理場景需求的問題。此外，對於使用者來說，其在使用涉及推理的應用時，推理計算可以透過推理用戶端和推理伺服端無縫轉接到遠端具有推理加速資源的設備上進行，且推理用戶端和推理伺服端之間的互動對於使用者是無感知的，因此，可以保證涉及推理的應用的業務邏輯和使用者進行推理業務的使用習慣不變，低成本地實現了推理且提升了使用者使用體驗。實施例二本實施例對實施例一中的推理系統進行了進一步優化，如圖3a所示。如實施例一中所述，本實施例的推理系統包括：相互連接的第一計算設備202 和第二計算設備204 ，其中，第一計算設備202 中設置有推理用戶端2022 ，第二計算設備204 中設置有推理伺服端2042 以及推理加速資源2044 。第一計算設備202 中的推理用戶端2022 用於獲取進行推理的計算模型的模型資訊，並將所述模型資訊發送至第二計算設備204 中的推理伺服端2042 ；第二計算設備204 中的推理伺服端2042 用於透過推理加速資源2044 載入所述模型資訊指示的計算模型；第一計算設備202 中的推理用戶端2022 還用於獲取待推理資料，並將待推理資料發送至第二計算設備204 中的推理伺服端2042 ；第二計算設備204 中的推理伺服端2042 還用於使用推理加速資源2044 呼叫載入的所述計算模型，透過所述計算模型對待推理資料進行推理處理並向推理用戶端2022 回饋所述推理處理的結果。在一種可行方式中，第一計算設備202 和第二計算設備204 透過彈性網路相互連接。其中，所述彈性網路包括但不限於ENI(Elastic Network Interface)網路。彈性網路具有更好的可擴展性和靈活性，透過彈性網路連接第一計算設備202 和第二計算設備204 ，使得推理系統也具備更好的可擴展性和靈活性。但不限於此，在實際應用中，第一計算設備202 和第二計算設備204 之間可以採用任意適當的方式或網路相連，能夠順利實現兩者的資料互動即可。在本實施例中，可選地，推理用戶端2022 可以實現為嵌入第一計算設備202 中的深度學習框架內部的組件，或者，所述推理用戶端可以實現為可被所述深度學習框架呼叫的可呼叫文件。深度學習框架提供了一種深入學習的平台，基於深度學習框架，程式人員可以方便地部署各種計算模型，實現不同的推理功能。將推理用戶端2022 實現為適用於深度學習框架的組件或可呼叫文件的形式，一方面使其兼容性和適用性更好，另一方面，也大大降低了推理計算資源解耦的實現成本。類似地，推理伺服端2042 也同樣可實現為組件或可呼叫文件的形式。基於上述結構，本實施例的推理系統可以方便地透過推理用戶端2022 和推理伺服端2042 進行相應資料和資訊的互動，透過遠端呼叫推理加速資源2044 實現推理處理。此外，本實施例中，推理用戶端2022 還用於在確定第二計算設備204 中不存在所述計算模型時，將所述計算模型發送至推理伺服端2042 。可選地，所述計算模型的模型資訊為所述計算模型的標識資訊或校驗資訊；推理伺服端2042 還用於透過所述標識資訊或所述校驗資訊，確定第二計算設備204 中是否存在所述計算模型，並將確定結果返回給推理用戶端2022 。但不限於此，其它確定第二計算設備204 是否存在所述計算模型的方式也同樣適用，如，每隔一定時間第二計算設備204 向外廣播其所具有的計算模型；或者，在需要時或每隔一定時間，第一計算設備202 主動發送訊息詢問第二計算設備204 中的計算模型的資源情況，等等。例如，若第二計算設備204 中未預先設置計算模型的資源池或者資源池中沒有所需的計算模型時，推理用戶端2022 會將第一計算設備202 中的計算模型發送給第二計算設備204 ，包括但不限於計算模型的結構和其包含的資料。在推理用戶端2022 向推理伺服端2042 發送的計算模型的資訊為標識資訊或校驗資訊的情況下，推理伺服端2042 會先根據接收到的標識資訊或校驗資訊來確定第二計算設備204 中是否存在所需的計算模型，並將確定結果返回給推理用戶端2022 。若該確定結果指示第二計算設備204 中不存在所需的計算模型時，第一計算設備202 從本地獲取計算模型並將其發送給第二計算設備204 ，借由第二計算設備204 的推理加速資源運行該計算模型，進行推理處理。透過這種方式，可以有效保證具備推理加速資源的第二計算設備204 能夠順利完成推理處理。除此之外，在一種可行方式中，推理用戶端2022 還用於獲取請求所述計算模型對所述待推理資料進行推理處理的推理請求，並對所述推理請求進行語義分析，根據語義分析結果確定待呼叫的所述計算模型中的處理函式，將所述處理函式的資訊發送給所述推理伺服端2042 ；所述推理伺服端2042 在透過所述計算模型對所述待推理資料進行推理處理時，透過呼叫載入的所述計算模型中所述處理函式的資訊指示的處理函式，對所述待推理資料進行推理處理。在一些推理的具體業務應用中，業務可能並不需要計算模型所有的推理功能，僅需部分功能即可。例如，某個推理用於對文字內容進行分類，而當前業務僅需其中的計算功能對相應文字向量進行加法計算，則此種情況下，當推理用戶端2022 接收到請求某個計算模型對文字向量進行加法計算的推理請求後，透過對該推理請求進行語義分析，確定僅需呼叫該計算模型中的COMPUTE()函式即可，則可將該函式的資訊發送給推理伺服端2042 。推理伺服端2042 在獲得該函式的資訊後，可直接呼叫計算模型中的COMPUTE()函式進行文字向量的加法計算即可。可見，透過這種方式，使得計算模型的使用更為精準，大大提高了計算模型的推理效率，並且，降低了推理負擔。在一種可行方式中，處理函式的資訊可以為處理函式的API介面資訊，透過API介面資訊既可快速地確定待使用的計算模型中的處理函式，還可以直接獲取該函式相應的介面資訊，以在後續進行推理處理時直接使用。可選地，第二計算設備204 中設置有一種或多種類型的推理加速資源；當推理加速資源包括多種類型時，不同類型的推理加速資源具有不同的使用優先級；推理伺服端2042 根據預設的負載均衡規則和多種類型的推理加速資源的優先級，使用推理加速資源。例如，第二計算設備204 中除設置有GPU外，還可以設置NPU或其它推理加速資源。多種推理加速資源間具有一定的優先級，所述優先級可以透過任意適當方式設置，如根據運行速度設置或者人工設置等等，本發明實施例對此不作限制。進一步可選地，第二計算設備204 中還可以設置CPU。此種情況下，可以設置GPU的使用優先級最高、NPU次之、CPU優先級最低。透過這種方式，當高優先級的推理加速資源負荷較重時，則可使用優先級較低的推理加速資源進行推理處理。一方面，保證了推理處理可被有效執行，另一方面，也可降低推理加速資源的成本。需要說明的是，某一類型的推理加速資源的數量可能為一個，也可能為多個，由本領域技術人員根據需求設置，本發明實施例對此不作限制。此外，預設的負載均衡規則除上述按優先級進行負載均衡外，本領域技術人員還可以根據實際需要設置其它適當的負載均衡規則，本發明實施例對此不作限制。以下，以一個具體實例，對本實施例中的推理系統進行說明。如圖3b所示，與傳統CPU和推理加速資源如GPU在同一台電子設備的架構不同，本實例中，將CPU和推理加速資源解耦成兩部分，即圖3b中的CPU client machine(前台用戶機器)和Server accelerator pools(後台推理加速卡機器)。其中，前台用戶機器是使用者可操作的用於執行推理業務的機器，後台推理加速卡機器用於推理的計算，兩者的通訊透過ENI實現。在前台用戶機器中設置有多個推理框架，圖中示意為“Tensorflow inference code”、“pyTorch inferce code”和“Mxnet inference code”。在後台推理加速卡機器中設置有多種推理加速卡，圖中示意為“GPU”、“Ali-NPU”和“Other Accelerator”。為了實現將前台用戶機器的推理業務轉發到後台加速卡執行並返回推理的結果以實現使用者側無感支援，本實例中提供了兩個組件分別常駐於前台用戶機器和後台加速卡機器，分別是EAI client module(即推理用戶端)和service daemon(即推理伺服端)。其中，EAI client module是前台用戶機器中的組件，其功能包括：a)和後台service daemon透過網路連接通訊；b)解析計算模型的語義和推理請求；c)將計算模型的語義的解析結果和推理請求的解析結果發送給後台的service daemon；d)接收service daemon發送的推理結果並返回給深度學習框架。在一種實現方式中，EAI client module可以實現為plugin模組嵌入到深度學習框架(如Tensorflow/pyTorch/Mxnet等)的功能碼裡，在推理業務透過深度學習框架加載計算模型的時候，EAI client module 會截獲載入的計算模型，解析計算模型的語義以生成計算模型的資訊，如校驗資訊(可為MD5資訊)，將計算模型和/或計算模型的資訊，以及後續操作轉接到後端service daemon進行實際的推理計算。 Service daemon是後台推理加速卡機器的常駐服務組件，其功能包括：a)接收EAI client module發送的計算模型的資訊和推理請求的解析結果；b)根據計算模型的資訊和推理請求的解析結果，選取後台推理加速卡機器中最優的推理加速卡；c)將推理計算下發給推理加速卡；d)接收推理加速卡計算的推理結果並返回給EAI client module。其中，GPU、Ali-NPU和Other Accelerator之間具有一定的優先級，如，從高到低依次為GPU-＞Ali-NPU-＞Other Accelerator，則在實際使用時，優先使用GPU，若GPU資源不夠再使用Ali-NPU，若Ali-NPU的資源仍不夠，再使用Other Accelerator。可見，與傳統透過PCIE卡槽將CPU和GPU推理加速卡綁定在一台機器不同的是，本實例中的彈性遠端推理透過彈性網卡將CPU機器(CPU client machine)和推理加速卡(server accelerator pools)解耦，對使用者來說，購買CPU和GPU同台的機器不再成為必須。基於圖3b所示推理系統的推理過程如圖3c所示，包括：步驟①，EAI client module在使用者透過深度學習框架啓動推理任務載入計算模型的時候截取並解析計算模型的語義，獲得計算模型的資訊；進而，獲取使用者的推理請求，並解析推理請求，獲得計算模型中待使用的處理函式的資訊；步驟②，EAI client module透過彈性網路與sevice daemon連接，將計算模型的資訊和處理函式的資訊轉發給後台的service daemon；步驟③，service daemon根據計算模型的資訊和處理函式的資訊，選取最優的推理加速卡，並透過推理加速卡載入計算模型進行推理計算；步驟④，推理加速卡將推理計算的結果返回給service daemon；步驟⑤，service daemon將推理計算的結果透過彈性網路轉發給EAI client daemon；步驟o，EAI client daemon將推理計算的結果返回給深度學習框架。由此，使用者在前台用戶機器進行推理業務，EAI client module和service daemon在後台自動將使用者的推理業務轉發到遠端推理加速卡進行推理計算，並將推理計算的結果返回給前台用戶機器的深度學習框架，做到了使用者的無感彈性推理，無需改動推理碼就能享受到推理加速服務。並且，使用者無需購買帶有GPU的機器，只需要透過普通的CPU機器就可以完成相同的推理加速效果，且不需要修改任何碼邏輯。在一個深度學習框架為Tensorflow框架的具體示例中，前台用戶機器和後台推理加速卡機器的互動如圖3d所示。該推理互動包括：步驟1，前台用戶機器透過Tensorflow框架載入計算模型；步驟2，EAI client module截獲計算模型並校驗模型；步驟3，EAI client module與service daemon建立通道，傳送計算模型；步驟4，service daemon分析計算模型，並根據分析結果從加速卡池中選擇最優的推理加速器；步驟5，選擇出的推理加速器載入計算模型；步驟6，使用者輸入圖片/文字並發起推理請求；步驟7，；步驟8，；步驟9，service daemon將處理函式的資訊和待處理資料轉發給推理加速器；步驟10，推理加速器透過計算模型進行推理計算並將推理計算的推理結果發送給service daemon；步驟11，service daemon將推理結果傳輸到EAI client module；步驟12，EAI client module接收推理結果並將推理結果轉交給Tensorflow框架；步驟13，Tensorflow框架將推理結果展示給使用者。由此，實現了Tensorflow框架下的彈性推理過程。根據本實施例提供的推理系統，將推理處理部署在不同的第一和第二計算設備中，其中，第二計算設備中設置有推理加速資源，可以透過計算模型進行主要的推理處理，而第一計算設備則可以負責推理處理之前和之後的資料處理。並且，第一計算設備中部署有推理用戶端，第二計算設備中部署有推理伺服端，在進行推理時，第一計算設備和第二計算設備透過推理用戶端和推理伺服端進行互動。推理用戶端可以先將計算模型的模型資訊發送給推理伺服端，推理伺服端使用推理加速資源載入相應的計算模型；接著，推理用戶端向推理伺服端發送待推理資料，推理伺服端在接收到待推理資料後，即可透過載入的計算模型進行推理處理。由此，實現了推理所使用的計算資源的解耦，透過計算模型進行的推理處理和推理處理之外的資料處理可以透過不同的計算設備實現，其中一台配置有推理加速資源如GPU即可，無需一台電子設備上同時具有CPU和GPU，有效解決了因現有異構計算機器的CPU/GPU的規格固定，而使涉及推理的應用的部署受限，導致無法滿足廣泛的推理場景需求的問題。此外，對於使用者來說，其在使用涉及推理的應用時，推理計算可以透過推理用戶端和推理伺服端無縫轉接到遠端具有推理加速資源的設備上進行，且推理用戶端和推理伺服端之間的互動對於使用者是無感知的，因此，可以保證涉及推理的應用的業務邏輯和使用者進行推理業務的使用習慣不變，低成本地實現了推理且提升了使用者使用體驗。實施例三參照圖4，示出了根據本發明實施例三的一種推理方法的流程圖。本實施例的推理方法從第一計算設備的角度，對本發明的推理方法進行說明。本實施例的推理方法包括以下步驟：步驟S302：獲取進行推理的計算模型的模型資訊，並將所述模型資訊發送至目標計算設備，以指示所述目標計算設備使用所述目標計算設備中設置的推理加速資源載入所述模型資訊指示的計算模型。本實施例中的目標計算設備的實現可參照前述實施例中的第二計算設備。本步驟的執行可參照前述多個實施例中推理用戶端的相關部分，例如，在深度學習框架載入計算模型時即可獲取計算模型的模型資訊，進而發送給目標計算設備。目標計算設備接收到模型資訊後，透過相應的推理加速資源如GPU載入對應的計算模型。步驟S304：獲取待推理資料，並將所述待推理資料發送至所述目標計算設備，以指示所述目標計算設備使用推理加速資源呼叫載入的所述計算模型，透過所述計算模型對所述待推理資料進行推理處理。其中，如前所述，待推理資料是使用計算模型進行推理計算的任意適當的資料，在目標計算設備載入計算模型後，即可將待推理資料發送至目標計算設備。目標計算設備收到該待推理資料後，使用推理加速資源如GPU載入的計算模型對待推理資料進行推理處理。步驟S306：接收所述目標計算設備回饋的所述推理處理的結果。在目標計算設備使用GPU載入的計算模型對待推理資料進行推理處理完成後，獲得推理處理的結果並將其發送給本實施例的執行主體如推理用戶端，推理用戶端則接收該推理處理的結果。在具體實現時，本實施例的推理方法可以由前述實施例中第一計算設備的推理用戶端實現，上述過程的具體實現也可參照前述實施例中推理用戶端的操作，在此不再贅述。透過本實施例，將推理處理部署在不同的計算設備中，其中，目標計算設備中設置有推理加速資源，可以透過計算模型進行主要的推理處理，而執行本實施例的推理方法的當前計算設備則可以負責推理處理之前和之後的資料處理。在進行推理時，當前計算設備可以先將計算模型的模型資訊發送給目標計算設備，目標計算設備使用推理加速資源載入相應的計算模型；接著，當前計算設備向目標計算設備發送待推理資料，目標計算設備在接收到待推理資料後，即可透過載入的計算模型進行推理處理。由此，實現了推理所使用的計算資源的解耦，透過計算模型進行的推理處理和推理處理之外的資料處理可以透過不同的計算設備實現，其中一台配置有推理加速資源如GPU即可，無需一台電子設備上同時具有CPU和GPU，有效解決了因現有異構計算機器的CPU/GPU的規格固定，而使涉及推理的應用的部署受限，導致無法滿足廣泛的推理場景需求的問題。此外，對於使用者來說，其在使用涉及推理的應用時，推理計算可以無縫轉接到遠端具有推理加速資源的目標計算設備上進行，且當前計算設備和目標計算設備之間的互動對於使用者是無感知的，因此，可以保證涉及推理的應用的業務邏輯和使用者進行推理業務的使用習慣不變，低成本地實現了推理且提升了使用者使用體驗。實施例四本實施例的推理方法仍從第一計算設備的角度，對本發明的推理方法進行說明。參照圖5，示出了根據本發明實施例四的一種推理方法的流程圖，該推理方法包括：步驟S402：獲取進行推理的計算模型的模型資訊，並將所述模型資訊發送至目標計算設備。在一種可行方式中，所述計算模型的模型資訊為所述計算模型的標識資訊或校驗資訊。在具體實現時，第一計算設備可將所述標識資訊或校驗資訊發送給目標計算設備，如前述實施例中的第二計算設備，由目標計算設備根據所述標識資訊或校驗資訊判斷本地是否具有對應的計算模型，並將判斷結果回饋給第一計算設備，實現第一計算設備對目標計算設備中是否存在計算模型的判斷。步驟S404：若根據所述模型資訊確定所述目標計算設備中不存在所述計算模型，則將所述計算模型發送至所述目標計算設備，並指示所述目標計算設備使用所述目標計算設備中設置的推理加速資源載入所述計算模型。當所述計算模型的模型資訊為所述計算模型的標識資訊或校驗資訊時，本步驟可以實現為：若透過所述標識資訊或所述校驗資訊，確定所述目標計算設備中不存在所述計算模型，則將所述計算模型發送至所述目標計算設備。包括：將計算模型的結構及其資料均發送給目標計算設備。若目標計算設備中不存在所需的計算模型，則可以將計算模型發送給目標計算設備。目標計算設備獲取並儲存該計算模型，在後續使用中若再次使用到該計算模型，則目標計算設備可以直接從本地獲取。由此，保證了不論目標計算設備中是否具有所需的計算模型，都可順利實現推理處理。步驟S406：獲取待推理資料，並將所述待推理資料發送至所述目標計算設備，以指示所述目標計算設備使用推理加速資源呼叫載入的所述計算模型，透過所述計算模型對所述待推理資料進行推理處理。在一種可行方式中，所述獲取待推理資料，並將所述待推理資料發送至所述目標計算設備可以包括：獲取請求所述計算模型對所述待推理資料進行推理處理的推理請求，並對所述推理請求進行語義分析；根據語義分析結果確定待呼叫的所述計算模型中的處理函式，將所述處理函式的資訊和所述待推理資料發送給所述目標計算設備，以指示所述目標計算設備透過呼叫載入的所述計算模型中所述處理函式的資訊指示的處理函式，對所述待推理資料進行推理處理。其中，所述處理函式的資訊可選地可以為所述處理函式的API介面資訊。步驟S408：接收所述目標計算設備回饋的所述推理處理的結果。在具體實現時，本實施例的推理方法可以由前述實施例中第一計算設備的推理用戶端實現，上述過程的具體實現也可參照前述實施例中推理用戶端的操作，在此不再贅述。透過本實施例，將推理處理部署在不同的計算設備中，其中，目標計算設備中設置有推理加速資源，可以透過計算模型進行主要的推理處理，而執行本實施例的推理方法的當前計算設備則可以負責推理處理之前和之後的資料處理。在進行推理時，當前計算設備可以先將計算模型的模型資訊發送給目標計算設備，目標計算設備使用推理加速資源載入相應的計算模型；接著，當前計算設備向目標計算設備發送待推理資料，目標計算設備在接收到待推理資料後，即可透過載入的計算模型進行推理處理。由此，實現了推理所使用的計算資源的解耦，透過計算模型進行的推理處理和推理處理之外的資料處理可以透過不同的計算設備實現，其中一台配置有推理加速資源如GPU即可，無需一台電子設備上同時具有CPU和GPU，有效解決了因現有異構計算機器的CPU/GPU的規格固定，而使涉及推理的應用的部署受限，導致無法滿足廣泛的推理場景需求的問題。此外，對於使用者來說，其在使用涉及推理的應用時，推理計算可以無縫轉接到遠端具有推理加速資源的目標計算設備上進行，且當前計算設備和目標計算設備之間的互動對於使用者是無感知的，因此，可以保證涉及推理的應用的業務邏輯和使用者進行推理業務的使用習慣不變，低成本地實現了推理且提升了使用者使用體驗。實施例五參照圖6，示出了根據本發明實施例五的一種推理方法的流程圖。本實施例的推理方法從第二計算設備的角度，對本發明的推理方法進行說明。本實施例的推理方法包括以下步驟：步驟S502：獲取源計算設備發送的用於推理的計算模型的模型資訊，透過推理加速資源載入所述模型資訊指示的計算模型。本實施例中，源計算設備可以為前述實施例中的第一計算設備，所述模型資訊包括但不限於標識資訊和/或校驗資訊。步驟S504：獲取所述源計算設備發送的待推理資料，使用推理加速資源呼叫載入的所述計算模型，透過所述計算模型對所述待推理資料進行推理處理。透過推理加速資源載入計算模型後，當接收到從源計算設備發送來的待推理資料，即可使用推理加速資源載入的計算模型對其進行推理處理。步驟S506：向所述源計算設備回饋所述推理處理的結果。在具體實現時，本實施例的推理方法可以由前述實施例中第二計算設備的推理伺服端實現，上述過程的具體實現也可參照前述實施例中推理伺服端的操作，在此不再贅述。透過本實施例，將推理處理部署在不同的計算設備中，其中，執行本實施例的推理方法的當前計算設備中設置有推理加速資源，可以透過計算模型進行主要的推理處理，而源計算設備則可以負責推理處理之前和之後的資料處理。在進行推理時，源計算設備可以先將計算模型的模型資訊發送給當前計算設備，當前計算設備使用推理加速資源載入相應的計算模型；接著，源計算設備向當前計算設備發送待推理資料，當前計算設備在接收到待推理資料後，即可透過載入的計算模型進行推理處理。由此，實現了推理所使用的計算資源的解耦，透過計算模型進行的推理處理和推理處理之外的資料處理可以透過不同的計算設備實現，其中一台配置有推理加速資源如GPU即可，無需一台電子設備上同時具有CPU和GPU，有效解決了因現有異構計算機器的CPU/GPU的規格固定，而使涉及推理的應用的部署受限，導致無法滿足廣泛的推理場景需求的問題。此外，對於使用者來說，其在使用涉及推理的應用時，推理計算可以無縫轉接到遠端具有推理加速資源的設備上進行，且源計算設備和當前計算設備之間的互動對於使用者是無感知的，因此，可以保證涉及推理的應用的業務邏輯和使用者進行推理業務的使用習慣不變，低成本地實現了推理且提升了使用者使用體驗。實施例六本實施例的推理方法仍從第二計算設備的角度，對本發明的推理方法進行說明。參照圖7，示出了根據本發明實施例六的一種推理方法的流程圖，該推理方法包括：步驟S602：根據計算模型的模型資訊，確定本地不存在所述計算模型，則向所述源計算設備請求所述計算模型，並在從所述源計算設備獲取所述計算模型後，透過所述推理加速資源載入所述計算模型。本實施例中，計算模型的模型資訊可以為所述計算模型的標識資訊或校驗資訊；則，本步驟可以實現為：根據所述標識資訊或所述校驗資訊，確定本地不存在所述計算模型，則向所述源計算設備請求所述計算模型，並在從所述源計算設備獲取所述計算模型後，透過所述推理加速資源載入所述計算模型。所述源計算設備發送來的計算模型包括但不限於計算模型的結構及其對應資料。此外，在一種可選方式中，所述推理加速資源包括一種或多種類型；當所述推理加速資源包括多種類型時，不同類型的推理加速資源具有不同的使用優先級；則，所述透過推理加速資源載入所述模型資訊指示的計算模型包括：根據預設的負載均衡規則和多種類型的所述推理加速資源的優先級，使用推理加速資源載入所述模型資訊指示的計算模型。其中，所述負載均衡規則和所述優先級均可由本領域技術人員根據實際需求適當設置。步驟S604：獲取所述源計算設備發送的待推理資料，使用推理加速資源呼叫載入的所述計算模型，透過所述計算模型對所述待推理資料進行推理處理。在一種可行方式中，本步驟可以實現為：獲取源計算設備發送的待推理資料和待呼叫的所述計算模型中的處理函式的資訊，透過呼叫載入的所述計算模型中所述處理函式的資訊指示的處理函式，對所述待推理資料進行推理處理。其中，所述處理函式的資訊可以由源計算設備透過對推理請求進行解析獲得。可選地，所述處理函式的資訊為所述處理函式的API介面資訊。步驟S606：向所述源計算設備回饋所述推理處理的結果。在具體實現時，本實施例的推理方法可以由前述實施例中第二計算設備的推理伺服端實現，上述過程的具體實現也可參照前述實施例中推理伺服端的操作，在此不再贅述。透過本實施例，將推理處理部署在不同的計算設備中，其中，執行本實施例的推理方法的當前計算設備中設置有推理加速資源，可以透過計算模型進行主要的推理處理，而源計算設備則可以負責推理處理之前和之後的資料處理。在進行推理時，源計算設備可以先將計算模型的模型資訊發送給當前計算設備，當前計算設備使用推理加速資源載入相應的計算模型；接著，源計算設備向當前計算設備發送待推理資料，當前計算設備在接收到待推理資料後，即可透過載入的計算模型進行推理處理。由此，實現了推理所使用的計算資源的解耦，透過計算模型進行的推理處理和推理處理之外的資料處理可以透過不同的計算設備實現，其中一台配置有推理加速資源如GPU即可，無需一台電子設備上同時具有CPU和GPU，有效解決了因現有異構計算機器的CPU/GPU的規格固定，而使涉及推理的應用的部署受限，導致無法滿足廣泛的推理場景需求的問題。此外，對於使用者來說，其在使用涉及推理的應用時，推理計算可以無縫轉接到遠端具有推理加速資源的設備上進行，且源計算設備和當前計算設備之間的互動對於使用者是無感知的，因此，可以保證涉及推理的應用的業務邏輯和使用者進行推理業務的使用習慣不變，低成本地實現了推理且提升了使用者使用體驗。實施例七參照圖8，示出了根據本發明實施例七的一種電子設備的結構示意圖，本發明具體實施例並不對電子設備的具體實現做限定。如圖8所示，該電子設備可以包括：處理器(processor)702、通訊介面(Communications Interface) 704、記憶體(memory)706、以及通訊匯流排708。其中：處理器702、通訊介面704、以及記憶體706透過通訊匯流排708完成相互間的通訊。通訊介面704，用於與其它電子設備或伺服器進行通訊。處理器702，用於執行程式710，具體可以執行上述實施例三或四中的推理方法實施例中的相關步驟。具體地，程式710可以包括程式碼，該程式碼包括電腦操作指令。處理器702可能是中央處理器CPU，或者是特定積體電路ASIC(Application Specific Integrated Circuit)，或者是被配置成實施本發明實施例的一個或多個積體電路。智慧設備包括的一個或多個處理器，可以是同一類型的處理器，如一個或多個CPU；也可以是不同類型的處理器，如一個或多個CPU以及一個或多個ASIC。記憶體706，用於存放程式710。記憶體706可能包含高速RAM記憶體，也可能還包括非揮發性記憶體(non-volatile memory)，例如至少一個磁碟記憶體。程式710具體可以用於使得處理器702執行以下操作：獲取進行推理的計算模型的模型資訊，並將所述模型資訊發送至目標計算設備，以指示所述目標計算設備使用所述目標計算設備中設置的推理加速資源載入所述模型資訊指示的計算模型；獲取待推理資料，並將所述待推理資料發送至所述目標計算設備，以指示所述目標計算設備使用推理加速資源呼叫載入的所述計算模型，透過所述計算模型對所述待推理資料進行推理處理；接收所述目標計算設備回饋的所述推理處理的結果。在一種可選的實施方式中，程式710還用於使得處理器702在若確定所述目標計算設備中不存在所述計算模型時，則將所述計算模型發送至所述目標計算設備。在一種可選的實施方式中，所述計算模型的模型資訊為所述計算模型的標識資訊或校驗資訊；程式710還用於使得處理器702在所述若確定所述目標計算設備中不存在所述計算模型時，則將所述計算模型發送至所述目標計算設備之前，透過所述標識資訊或所述校驗資訊，確定所述目標計算設備中是否存在所述計算模型。在一種可選的實施方式中，程式710還用於使得處理器702在獲取待推理資料，並將所述待推理資料發送至所述目標計算設備時：獲取請求所述計算模型對所述待推理資料進行推理處理的推理請求，並對所述推理請求進行語義分析；根據語義分析結果確定待呼叫的所述計算模型中的處理函式，將所述處理函式的資訊和所述待推理資料發送給所述目標計算設備，以指示所述目標計算設備透過呼叫載入的所述計算模型中所述處理函式的資訊指示的處理函式，對所述待推理資料進行推理處理。在一種可選的實施方式中，所述處理函式的資訊為所述處理函式的API介面資訊。程式710中各步驟的具體實現可以參見上述相應的推理方法實施例中的相應步驟和單元中對應的描述，在此不贅述。所屬領域的技術人員可以清楚地瞭解到，為描述的方便和簡潔，上述描述的設備和模組的具體工作過程，可以參考前述方法實施例中的對應過程描述，在此不再贅述。透過本實施例的電子設備，將推理處理部署在不同的計算設備中，其中，目標計算設備中設置有推理加速資源，可以透過計算模型進行主要的推理處理，而執行本實施例的推理方法的當前電子設備則可以負責推理處理之前和之後的資料處理。在進行推理時，當前電子設備可以先將計算模型的模型資訊發送給目標計算設備，目標計算設備使用推理加速資源載入相應的計算模型；接著，當前電子設備向目標計算設備發送待推理資料，目標計算設備在接收到待推理資料後，即可透過載入的計算模型進行推理處理。由此，實現了推理所使用的計算資源的解耦，透過計算模型進行的推理處理和推理處理之外的資料處理可以透過不同的計算設備實現，其中一台配置有推理加速資源如GPU即可，無需一台電子設備上同時具有CPU和GPU，有效解決了因現有異構計算機器的CPU/GPU的規格固定，而使涉及推理的應用的部署受限，導致無法滿足廣泛的推理場景需求的問題。此外，對於使用者來說，其在使用涉及推理的應用時，推理計算可以無縫轉接到遠端具有推理加速資源的目標計算設備上進行，且當前電子設備和目標計算設備之間的互動對於使用者是無感知的，因此，可以保證涉及推理的應用的業務邏輯和使用者進行推理業務的使用習慣不變，低成本地實現了推理且提升了使用者使用體驗。實施例八參照圖9，示出了根據本發明實施例八的一種電子設備的結構示意圖，本發明具體實施例並不對電子設備的具體實現做限定。如圖9所示，該電子設備可以包括：處理器(processor)802、通訊介面(Communications Interface) 804、記憶體(memory)806、以及通訊匯流排808。其中：處理器802、通訊介面804、以及記憶體806透過通訊匯流排808完成相互間的通訊。通訊介面804，用於與其它電子設備或伺服器進行通訊。處理器802，用於執行程式810，具體可以執行上述實施例五或六中的推理方法實施例中的相關步驟。具體地，程式810可以包括程式碼，該程式碼包括電腦操作指令。處理器802可能是中央處理器CPU，或者是特定積體電路ASIC(Application Specific Integrated Circuit)，或者是被配置成實施本發明實施例的一個或多個積體電路。智慧設備包括的一個或多個處理器，可以是同一類型的處理器，如一個或多個CPU；也可以是不同類型的處理器，如一個或多個CPU以及一個或多個ASIC。記憶體806，用於存放程式810。記憶體806可能包含高速RAM記憶體，也可能還包括非揮發性記憶體(non-volatile memory)，例如至少一個磁碟記憶體。程式810具體可以用於使得處理器802執行以下操作：獲取源計算設備發送的用於推理的計算模型的模型資訊，透過推理加速資源載入所述模型資訊指示的計算模型；獲取所述源計算設備發送的待推理資料，使用推理加速資源呼叫載入的所述計算模型，透過所述計算模型對所述待推理資料進行推理處理；向所述源計算設備回饋所述推理處理的結果。在一種可選的實施方式中，所述計算模型的模型資訊為所述計算模型的標識資訊或校驗資訊；程式810還用於使得處理器802在獲取源計算設備發送的用於推理的計算模型的模型資訊，透過推理加速資源載入所述模型資訊指示的計算模型時：根據所述標識資訊或所述校驗資訊，確定本地不存在所述計算模型，則向所述源計算設備請求所述計算模型，並在從所述源計算設備獲取所述計算模型後，透過所述推理加速資源載入所述計算模型。在一種可選的實施方式中，程式810還用於使得處理器802在獲取所述源計算設備發送的待推理資料，使用推理加速資源呼叫載入的所述計算模型，透過所述計算模型對所述待推理資料進行推理處理時：獲取源計算設備發送的待推理資料和待呼叫的所述計算模型中的處理函式的資訊，透過呼叫載入的所述計算模型中所述處理函式的資訊指示的處理函式，對所述待推理資料進行推理處理。在一種可選的實施方式中，所述處理函式的資訊為所述處理函式的API介面資訊。在一種可選的實施方式中，所述推理加速資源包括一種或多種類型；當所述推理加速資源包括多種類型時，不同類型的推理加速資源具有不同的使用優先級；程式810還用於使得處理器802在透過推理加速資源載入所述模型資訊指示的計算模型時：根據預設的負載均衡規則和多種類型的所述推理加速資源的優先級，使用推理加速資源載入所述模型資訊指示的計算模型。程式810中各步驟的具體實現可以參見上述相應的推理方法實施例中的相應步驟和單元中對應的描述，在此不贅述。所屬領域的技術人員可以清楚地瞭解到，為描述的方便和簡潔，上述描述的設備和模組的具體工作過程，可以參考前述方法實施例中的對應過程描述，在此不再贅述。透過本實施例的電子設備，將推理處理部署在不同的計算設備中，其中，執行本實施例的推理方法的當前電子設備中設置有推理加速資源，可以透過計算模型進行主要的推理處理，而源計算設備則可以負責推理處理之前和之後的資料處理。在進行推理時，源計算設備可以先將計算模型的模型資訊發送給當前電子設備，當前電子設備使用推理加速資源載入相應的計算模型；接著，源計算設備向當前電子設備發送待推理資料，當前電子設備在接收到待推理資料後，即可透過載入的計算模型進行推理處理。由此，實現了推理所使用的計算資源的解耦，透過計算模型進行的推理處理和推理處理之外的資料處理可以透過不同的計算設備實現，其中一台配置有推理加速資源如GPU即可，無需一台電子設備上同時具有CPU和GPU，有效解決了因現有異構計算機器的CPU/GPU的規格固定，而使涉及推理的應用的部署受限，導致無法滿足廣泛的推理場景需求的問題。此外，對於使用者來說，其在使用涉及推理的應用時，推理計算可以無縫轉接到遠端具有推理加速資源的設備上進行，且源計算設備和當前電子設備之間的互動對於使用者是無感知的，因此，可以保證涉及推理的應用的業務邏輯和使用者進行推理業務的使用習慣不變，低成本地實現了推理且提升了使用者使用體驗。需要指出，根據實施的需要，可將本發明實施例中描述的各個部件/步驟拆分為更多部件/步驟，也可將兩個或多個部件/步驟或者部件/步驟的部分操作組合成新的部件/步驟，以實現本發明實施例的目的。上述根據本發明實施例的方法可在硬體、韌體中實現，或者被實現為可儲存在記錄媒體(諸如CD ROM、RAM、軟碟、硬碟或磁光碟)中的軟體或電腦碼，或者被實現透過網路下載的原始儲存在遠端記錄媒體或非暫時機器可讀媒體中並將被儲存在本地記錄媒體中的電腦碼，從而在此描述的方法可被儲存在使用通用電腦、專用處理器或者可程式化或專用硬體(諸如ASIC或FPGA)的記錄媒體上的這樣的軟體處理。可以理解，電腦、處理器、微處理器控制器或可程式化硬體包括可儲存或接收軟體或電腦碼的儲存組件(例如，RAM、ROM、閃存等)，當所述軟體或電腦碼被電腦、處理器或硬體存取且執行時，實現在此描述的推理方法。此外，當通用電腦存取用於實現在此示出的推理方法的碼時，碼的執行將通用電腦轉換為用於執行在此示出的推理方法的專用電腦。本領域普通技術人員可以意識到，結合本文中所揭示的實施例描述的各示例的單元及方法步驟，能夠以電子硬體、或者電腦軟體和電子硬體的結合來實現。這些功能究竟以硬體還是軟體方式來執行，取決於技術方案的特定應用和設計約束條件。專業技術人員可以對每個特定的應用來使用不同方法來實現所描述的功能，但是這種實現不應認為超出本發明實施例的範圍。以上實施方式僅用於說明本發明實施例，而並非對本發明實施例的限制，有關技術領域的普通技術人員，在不脫離本發明實施例的精神和範圍的情況下，還可以做出各種變化和變型，因此所有等同的技術方案也屬於本發明實施例的範疇，本發明實施例的專利保護範圍應由申請專利範圍限定。In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present invention, the following will clearly and completely describe the technical solutions in the embodiments of the present invention in conjunction with the drawings in the embodiments of the present invention. Obviously, the description The embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments in the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art should fall within the protection scope of the embodiments of the present invention. The specific implementation of the embodiments of the present invention will be further described below in conjunction with the drawings of the embodiments of the present invention. Embodiment 1 Referring to FIG. 2a, it shows a block diagram of a structure of an inference system according to Embodiment 1 of the present invention. The inference system of this embodiment includes a first computing device 202 and a second computing device 204 connected to each other, wherein the first computing device 202 is provided with an inference client 2022 , and the second computing device 204 is provided with an inference server 2042 and Inference acceleration resources 2044 . Wherein, the inference client 2022 is used to obtain model information and data to be inferred of the calculation model for inference, and respectively send the model information and the data to be inferred to the inference server 2042 in the second computing device 204 The reasoning server 2042 is used to load and call the calculation model indicated by the model information through the reasoning acceleration resource 2044 , perform reasoning processing on the data to be reasoned through the calculation model, and return the reasoning processing to the reasoning client 2022 the result of. In a feasible implementation, the inference client 2022 in the first computing device 202 first obtains model information of the computing model for inference, and sends the model information to the inference server 2042 in the second computing device 204 ; The inference server 2042 in the second computing device 204 loads the computing model indicated by the model information through the inference acceleration resource 2044 ; the inference client 2022 in the first computing device 202 then obtains the data to be inferred, and sends the data to be inferred to second computing device 204 reasoning server end 2042; a second computing device 204 in server end 2042 reasoning using reasoning model acceleration computing resource 2044 calls the loading of data through the treatment reasoning model calculation and inference processing The reasoning client 2022 feeds back the result of the reasoning process. In the above recommendation system, because the second computing device 204 has the reasoning acceleration resource 2044 used for reasoning, it can effectively load a calculation model for reasoning and perform reasoning calculations with a large amount of data. Also, because the second computing device 204 where the inference acceleration resource 2044 is located and the first computing device 202 are independently set, the inference acceleration resource 2044, such as the GPU, does not need to follow a fixed specification setting with the processor resources in the first computing device 202, such as the CPU, so that The realization of the inference acceleration resource 2044 is more flexible and diverse. Among them, the inference acceleration resource 2044 can be implemented in various forms including but not limited to GPU and NPU. Therefore, it is only necessary to configure a resource such as a CPU for routine processing of data in 202 in the first computing device. The calculation model for reasoning may be any appropriate calculation model set according to business requirements, which may be applicable to deep learning frameworks (including but not limited to Tensorflow framework, Mxnet framework, and PyTorch framework). In a feasible manner, the second computing device 204 can be pre-configured with a resource pool of the computing model. If the computing model to be used is in the resource pool, it can be directly loaded and used; if it is not in the resource pool, it can be loaded from the first Obtained from a computing device 202. In another feasible manner, the second computing device 204 may not have a preset resource pool of the computing model. When inference is needed, the required computing model is obtained from the first computing device 202 and then stored locally. After multiple inferences, different calculation models can be obtained and finally stored as a resource pool of the calculation model. The obtained different computing models may come from different first computing devices 202 , that is, the second computing device 204 may provide inference services for different first computing devices 202 , so as to obtain different information from different first computing devices 202. Calculation model. The model information of the calculation model sent by the inference client 2022 to the inference server 2042 can uniquely identify the calculation model, for example, it can be the identification information of the calculation model, such as an identification ID number. But it is not limited to this. In a feasible way, the model information of the calculation model can also be the verification information of the calculation model, such as MD5 information. The verification information can identify the calculation model on the one hand, and can also calibrate the calculation model on the other hand. Experience, realize multiple functions through one kind of information, reduce the cost of information processing. The model information can be obtained when the first computing device 202 loads the model. Hereinafter, a specific example is used to illustrate the structure of the above-mentioned reasoning system, as shown in Fig. 2b. In Figure 2b, the first computing device 202 is implemented as a terminal device, that is, the first terminal device, in which a processor CPU is provided for corresponding service processing, and the second computing device 204 is also implemented as a terminal device, that is, the second terminal. The device, which is provided with the inference acceleration resource GPU. In addition, the first computing device 202 is loaded with a deep learning framework and a reasoning client set in the deep learning framework; the second computing device 204 is correspondingly set with a reasoning server. In this embodiment, it is also set that the second computing device 204 is provided with a resource pool of computing models, in which multiple computing models, such as computing models A, B, C, and D are stored. It should be understood by those skilled in the art that the above examples are only illustrative. In practical applications, the first computing device 202 and the second computing device 204 can both be implemented as terminal devices, or both can be implemented as servers, or, The first computing device 202 is implemented as a server and the second computing device 204 is implemented as a terminal device or vice versa, which is not limited in the embodiment of the present invention. Based on the reasoning system in Figure 2b, a process of using the reasoning system for reasoning is as follows. Take image recognition as an example. When the deep learning framework loads the model, the corresponding model information can be obtained. The inference client sends the calculation model information to the second terminal device, and the second terminal device receives the calculation through the inference server. Model information. Assuming that the information of the calculation model indicates that the calculation model to be used is calculation model A, and the resource pool of the second terminal device stores calculation models A, B, C, and D, then the second terminal device will directly access the resource through the GPU Load calculation model A in the pool. Furthermore, the second terminal device obtains the data to be inferred, such as the image to be recognized, from the first terminal device through the inference server and the inference client, and then the second terminal device uses the GPU to call the computing model A to perform target object recognition on the image, such as Identify whether there are people in the image. After the recognition, the second terminal device will send the recognition result to the inference client of the first terminal device through the inference server, and the inference client will hand it over to the CPU for subsequent processing, such as adding AR special effects. It should be noted that, in the embodiments of the present invention, unless otherwise specified, quantities such as "multiple" and "multiple" related to "multiple" mean two or more. According to the inference system provided in this embodiment, the inference processing is deployed in different first and second computing devices. The second computing device is equipped with inference acceleration resources, and the main inference processing can be performed through the computing model. A computing device can be responsible for data processing before and after inference processing. In addition, an inference client is deployed in the first computing device, and an inference server is deployed in the second computing device. During inference, the first computing device and the second computing device interact through the inference client and the inference server. The inference client can first send the model information of the calculation model to the inference server, and the inference server uses inference acceleration resources to load the corresponding calculation model; then, the inference client sends the data to be inferred to the inference server, and the inference server is receiving After the data to be inferred, inference processing can be performed through the loaded calculation model. As a result, the decoupling of computing resources used in inference is realized. Inference processing through the computing model and data processing other than inference processing can be achieved through different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem of the fixed CPU/GPU specifications of existing heterogeneous computing machines, which restricts the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios. problem. In addition, for users, when using applications involving reasoning, the reasoning calculation can be seamlessly transferred to the remote device with reasoning acceleration resources through the reasoning client and the reasoning server, and the reasoning client and reasoning The interaction between the servers is imperceptible to the user. Therefore, it can ensure that the business logic of the application involving reasoning and the user's use habits for reasoning business remain unchanged, and the reasoning is realized at low cost and the user experience is improved. . Embodiment 2 This embodiment further optimizes the reasoning system in Embodiment 1, as shown in Fig. 3a. As described in the first embodiment, the inference system of this embodiment includes: a first computing device 202 and a second computing device 204 connected to each other, wherein the first computing device 202 is provided with an inference client 2022 , and the second computing device In 204 , an inference server 2042 and an inference acceleration resource 2044 are provided . The first computing device 202 in the UE 2022 reasoning for obtaining information about the model calculation model reasoning, and transmits the information to the second computing device model 204 in reasoning server end 2042; in the second computing device 204 The reasoning server 2042 is used to load the calculation model indicated by the model information through the reasoning acceleration resource 2044 ; the reasoning client 2022 in the first computing device 202 is also used to obtain the data to be reasoned, and send the data to be reasoned to the second the computing device 204 reasoning server end 2042; a second computing device 204 in server end 2042 further reasoning using reasoning model acceleration computing resource 2044 calls the loading of data through the treatment reasoning model calculation and inference processing The result of the inference processing is fed back to the inference client 2022. In one feasible manner, the first computing device 202 and the second computing device 204 are connected to each other through an elastic network. The elastic network includes, but is not limited to, an ENI (Elastic Network Interface) network. The elastic network has better scalability and flexibility. The first computing device 202 and the second computing device 204 are connected through the elastic network, so that the reasoning system also has better scalability and flexibility. However, it is not limited to this. In practical applications, the first computing device 202 and the second computing device 204 can be connected in any suitable manner or network, and the data interaction between the two can be smoothly realized. In this embodiment, optionally, the inference client 2022 can be implemented as a component embedded in the deep learning framework in the first computing device 202 , or the inference client can be implemented as a component that can be called by the deep learning framework. The callable file. The deep learning framework provides a platform for deep learning. Based on the deep learning framework, programmers can easily deploy various computing models to achieve different inference functions. Implementing the inference client 2022 in the form of a component or callable file suitable for a deep learning framework on the one hand makes it more compatible and applicable, on the other hand, it also greatly reduces the implementation cost of decoupling inference computing resources. Similarly, the inference server 2042 can also be implemented in the form of a component or a callable file. Based on the above structure, the reasoning system of this embodiment can conveniently interact with corresponding data and information through the reasoning client 2022 and the reasoning server 2042 , and realize reasoning processing by remotely calling the reasoning acceleration resource 2044. In addition, in this embodiment, the inference client 2022 is also used to send the calculation model to the inference server 2042 when it is determined that the calculation model does not exist in the second computing device 204 . Optionally, the model information of the calculation model is identification information or verification information of the calculation model; the inference server 2042 is also used to determine the identification information or the verification information in the second computing device 204 Whether the calculation model exists, and the determination result is returned to the inference client 2022 . However, it is not limited to this, and other methods of determining whether the second computing device 204 has the calculation model are also applicable, for example, the second computing device 204 broadcasts the calculation model it has at regular intervals; or, when needed Or at regular intervals, the first computing device 202 actively sends a message to inquire about the resource status of the computing model in the second computing device 204, and so on. For example, if the resource pool of the computing model is not preset in the second computing device 204 or there is no required computing model in the resource pool, the inference client 2022 will send the computing model in the first computing device 202 to the second computing device. 204 , including but not limited to the structure of the calculation model and the data it contains. When the information of the calculation model sent by the inference client 2022 to the inference server 2042 is identification information or verification information, the inference server 2042 will first determine the second computing device 204 according to the received identification information or verification information. Whether the required calculation model exists in the database, and the result of the determination is returned to the inference client 2022 . If the result of the determination indicates that the required calculation model does not exist in the second computing device 204 , the first computing device 202 obtains the calculation model locally and sends it to the second computing device 204 , through the inference of the second computing device 204 Accelerate resources to run the calculation model and perform inference processing. In this way, it can be effectively ensured that the second computing device 204 with inference acceleration resources can successfully complete inference processing. In addition, in a feasible manner, the reasoning client 2022 is also used to obtain a reasoning request requesting the calculation model to perform reasoning processing on the data to be reasoned, and perform semantic analysis on the reasoning request, according to the semantic analysis As a result, the processing function in the calculation model to be called is determined, and the information of the processing function is sent to the inference server 2042 ; the inference server 2042 is processing the data to be inferred through the calculation model When inference processing is performed, inference processing is performed on the data to be inferred by calling the processing function indicated by the information of the processing function in the loaded calculation model. In some specific business applications of reasoning, the business may not need all the reasoning functions of the calculation model, but only some of the functions. For example, a certain reasoning is used to classify text content, and the current business only needs the calculation function in it to add the corresponding text vector. In this case, when the reasoning client 2022 receives a request for a calculation model to calculate the text After the vector performs the reasoning request for the addition calculation, through semantic analysis of the reasoning request, it is determined that only the COMPUTE() function in the calculation model needs to be called, and then the information of the function can be sent to the reasoning server 2042 . After the inference server 2042 obtains the information of the function, it can directly call the COMPUTE() function in the calculation model to perform the addition calculation of the text vector. It can be seen that in this way, the use of the calculation model is more accurate, the reasoning efficiency of the calculation model is greatly improved, and the reasoning burden is reduced. In a feasible way, the processing function information can be the API interface information of the processing function. Through the API interface information, the processing function in the calculation model to be used can be quickly determined, and the corresponding function can also be directly obtained Interface information for direct use in subsequent inference processing. Optionally, one or more types of inference acceleration resources are provided in the second computing device 204 ; when the inference acceleration resources include multiple types, the different types of inference acceleration resources have different usage priorities; the inference server 2042 is based on a preset The load balancing rules and the priority of various types of inference acceleration resources, use inference acceleration resources. For example, in addition to the GPU provided in the second computing device 204 , an NPU or other inference acceleration resources may also be provided. There is a certain priority among multiple inference acceleration resources, and the priority can be set in any appropriate manner, such as setting according to operating speed or manually, etc., which is not limited in the embodiment of the present invention. Further optionally, a CPU may also be provided in the second computing device 204. In this case, you can set the GPU to use the highest priority, the NPU second, and the CPU priority to be the lowest. In this way, when high-priority inference acceleration resources are heavily loaded, lower-priority inference acceleration resources can be used for inference processing. On the one hand, it ensures that inference processing can be effectively executed, and on the other hand, it can also reduce the cost of inference acceleration resources. It should be noted that the number of a certain type of reasoning acceleration resource may be one or multiple, which can be set by a person skilled in the art according to requirements, which is not limited in the embodiment of the present invention. In addition, in addition to the above-mentioned load balancing according to the priority of the preset load balancing rules, those skilled in the art can also set other appropriate load balancing rules according to actual needs, which is not limited in the embodiment of the present invention. Hereinafter, a specific example is used to describe the inference system in this embodiment. As shown in Figure 3b, unlike the traditional CPU and inference acceleration resources such as GPU in the same electronic device architecture, in this example, the CPU and inference acceleration resources are decoupled into two parts, namely the CPU client machine in Figure 3b. User machine) and Server accelerator pools (background reasoning accelerator card machine). Among them, the front-end user machine is a machine that can be operated by the user to perform reasoning business, and the back-end reasoning accelerator card machine is used for reasoning calculations. The communication between the two is realized through ENI. There are multiple inference frameworks in the front-end user's machine, as shown in the figure as "Tensorflow inference code", "pyTorch inferce code" and "Mxnet inference code". There are a variety of inference accelerator cards in the background inference accelerator card machine, which are shown as "GPU", "Ali-NPU" and "Other Accelerator" in the figure. In order to forward the reasoning business of the front-end user machine to the back-end accelerator card for execution and return the result of the reasoning to realize the user-side non-sense support, this example provides two components that reside in the front-end user machine and the background accelerator card machine respectively. It is the EAI client module (that is, the inference client) and the service daemon (that is, the inference server). Among them, the EAI client module is a component in the front-end user machine, and its functions include: a) communicate with the back-end service daemon through the network; b) analyze the semantics of the calculation model and inference requests; c) analyze the semantics of the calculation model And the analysis result of the reasoning request is sent to the back-end service daemon; d) The reasoning result sent by the service daemon is received and returned to the deep learning framework. In one implementation, the EAI client module can be implemented as a plugin module and embedded in the function code of a deep learning framework (such as Tensorflow/pyTorch/Mxnet, etc.). When the inference business loads the calculation model through the deep learning framework, the EAI client module It intercepts the loaded calculation model, analyzes the semantics of the calculation model to generate calculation model information, such as verification information (may be MD5 information), transfers calculation model and/or calculation model information, and transfers subsequent operations to the backend The service daemon performs actual inference calculations. Service daemon is the resident service component of the background reasoning accelerator card machine. Its functions include: a) receiving the information of the calculation model sent by the EAI client module and the analysis result of the reasoning request; b) according to the information of the calculation model and the analysis result of the reasoning request, Select the best reasoning accelerator card in the background reasoning accelerator card machine; c) send the reasoning calculation to the reasoning accelerator card; d) receive the reasoning result calculated by the reasoning accelerator card and return it to the EAI client module. Among them, GPU, Ali-NPU and Other Accelerator have a certain priority, for example, GPU->Ali-NPU->Other Accelerator from high to low, then in actual use, the GPU will be used first. If the GPU resource If Ali-NPU is not enough, if Ali-NPU resources are still insufficient, use Other Accelerator. It can be seen that, unlike the traditional binding of CPU and GPU inference accelerator cards to one machine through the PCIE card slot, the flexible remote inference in this example combines the CPU client machine and the inference accelerator card (server) through the flexible network card. accelerator pools) decoupling, for users, it is no longer necessary to buy a machine with the same CPU and GPU. The reasoning process based on the reasoning system shown in Figure 3b is shown in Figure 3c, including: Step ①, when the user starts the reasoning task through the deep learning framework and loads the calculation model, the EAI client module intercepts and analyzes the semantics of the calculation model to obtain the calculation Model information; further, obtain the user’s reasoning request, analyze the reasoning request, and obtain the information of the processing function to be used in the calculation model; step ②, the EAI client module connects with the service daemon through the elastic network to connect the calculation model The information and processing function information are forwarded to the back-end service daemon; step ③, the service daemon selects the optimal inference accelerator card based on the information of the calculation model and processing function information, and loads the calculation model through the inference accelerator card to perform inference Calculation; step ④, the inference accelerator card returns the result of the inference calculation to the service daemon; step ⑤, the service daemon forwards the result of the inference calculation to the EAI client daemon through the elastic network; step o, the EAI client daemon returns the result of the inference calculation Give the deep learning framework. As a result, the user conducts inference business on the front-end user machine, and the EAI client module and service daemon automatically forward the user's inference business to the remote inference accelerator card for inference calculation in the background, and return the result of the inference calculation to the front-end user machine. The deep learning framework enables users to make insensible and flexible reasoning, and they can enjoy reasoning acceleration services without changing the reasoning code. Moreover, users do not need to purchase a machine with a GPU, and only need to use an ordinary CPU machine to achieve the same inference acceleration effect, and there is no need to modify any code logic. In a specific example where the deep learning framework is the Tensorflow framework, the interaction between the front-end user machine and the back-end inference accelerator card machine is shown in Figure 3d. The reasoning interaction includes: step 1, the front-end user machine loads the calculation model through the Tensorflow framework; step 2, the EAI client module intercepts the calculation model and verifies the model; step 3, the EAI client module establishes a channel with the service daemon, and transmits the calculation model; step 4. The service daemon analyzes the calculation model and selects the optimal reasoning accelerator from the accelerator card pool according to the analysis results; step 5, the selected reasoning accelerator is loaded into the calculation model; step 6, the user enters a picture/text and initiates a reasoning request ; Step 7,; Step 8,; Step 9, the service daemon forwards the processing function information and the data to be processed to the inference accelerator; Step 10, the inference accelerator performs inference calculations through the calculation model and sends the inference results of the inference calculations to the service daemon; Step 11, the service daemon transmits the inference results to the EAI client module; Step 12, the EAI client module receives the inference results and transfers the inference results to the Tensorflow framework; Step 13, the Tensorflow framework displays the inference results to the user. As a result, the flexible reasoning process under the Tensorflow framework is realized. According to the inference system provided in this embodiment, the inference processing is deployed in different first and second computing devices. The second computing device is equipped with inference acceleration resources, and the main inference processing can be performed through the computing model. A computing device can be responsible for data processing before and after inference processing. In addition, an inference client is deployed in the first computing device, and an inference server is deployed in the second computing device. During inference, the first computing device and the second computing device interact through the inference client and the inference server. The inference client can first send the model information of the calculation model to the inference server, and the inference server uses inference acceleration resources to load the corresponding calculation model; then, the inference client sends the data to be inferred to the inference server, and the inference server is receiving After the data to be inferred, inference processing can be performed through the loaded calculation model. As a result, the decoupling of computing resources used in inference is realized. Inference processing through the computing model and data processing other than inference processing can be achieved through different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem of the fixed CPU/GPU specifications of existing heterogeneous computing machines, which restricts the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios. problem. In addition, for users, when using applications involving reasoning, the reasoning calculation can be seamlessly transferred to the remote device with reasoning acceleration resources through the reasoning client and the reasoning server, and the reasoning client and reasoning The interaction between the servers is imperceptible to the user. Therefore, it can ensure that the business logic of the application involving reasoning and the user's use habits for reasoning business remain unchanged, and the reasoning is realized at low cost and the user experience is improved. . Embodiment 3 Referring to FIG. 4, it shows a flowchart of an inference method according to Embodiment 3 of the present invention. The reasoning method of this embodiment describes the reasoning method of the present invention from the perspective of the first computing device. The inference method of this embodiment includes the following steps: Step S302: Obtain model information of a calculation model for inference, and send the model information to a target computing device to instruct the target computing device to use the settings in the target computing device The reasoning acceleration resource of is loaded into the calculation model indicated by the model information. The implementation of the target computing device in this embodiment can refer to the second computing device in the foregoing embodiment. The execution of this step can refer to the relevant parts of the inference client in the foregoing multiple embodiments. For example, the model information of the calculation model can be obtained when the calculation model is loaded in the deep learning framework, and then sent to the target computing device. After the target computing device receives the model information, it loads the corresponding computing model through corresponding inference acceleration resources such as GPU. Step S304: Obtain the data to be inferred, and send the data to be inferred to the target computing device to instruct the target computing device to use the inference acceleration resource to call the loaded calculation model, and use the calculation model to compare the data to the target computing device. Describe the data to be inferred for inference processing. Among them, as mentioned above, the data to be inferred is any appropriate data used for inference calculation using a calculation model. After the target computing device loads the calculation model, the data to be inferred can be sent to the target computing device. After the target computing device receives the data to be inferred, it uses inference acceleration resources such as the calculation model loaded by the GPU to perform inference processing on the inferred data. Step S306: Receive the result of the inference processing fed back by the target computing device. After the target computing device uses the calculation model loaded by the GPU to perform inference processing on the inference data to be inferred, the result of the inference processing is obtained and sent to the execution subject of this embodiment such as the inference client, and the inference client receives the inference processing result. result. In specific implementation, the inference method of this embodiment can be implemented by the inference client of the first computing device in the foregoing embodiment, and the specific implementation of the foregoing process can also refer to the operation of the inference client in the foregoing embodiment, which will not be repeated here. Through this embodiment, inference processing is deployed in different computing devices. Among them, the target computing device is equipped with inference acceleration resources, and the main inference processing can be performed through the computing model, and the current computing device that executes the inference method of this embodiment It can be responsible for the data processing before and after the reasoning processing. When performing inference, the current computing device can first send the model information of the computing model to the target computing device, and the target computing device uses inference acceleration resources to load the corresponding computing model; then, the current computing device sends the data to be inferred to the target computing device. After the target computing device receives the data to be inferred, it can perform inference processing through the loaded computing model. As a result, the decoupling of computing resources used in inference is realized. Inference processing through the computing model and data processing other than inference processing can be achieved through different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem of the fixed CPU/GPU specifications of existing heterogeneous computing machines, which restricts the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios. problem. In addition, for users, when using applications involving inference, the inference calculation can be seamlessly transferred to the remote target computing device with inference acceleration resources, and the interaction between the current computing device and the target computing device It is imperceptible to users. Therefore, it can ensure that the business logic of the application involving reasoning and the user's use habits of reasoning services remain unchanged, and the reasoning is realized at low cost and the user experience is improved. Embodiment 4 The reasoning method of this embodiment still describes the reasoning method of the present invention from the perspective of the first computing device. 5, there is shown a flow chart of an inference method according to the fourth embodiment of the present invention. The inference method includes: Step S402: Obtain model information of a calculation model for inference, and send the model information to a target computing device . In a feasible manner, the model information of the calculation model is identification information or verification information of the calculation model. In a specific implementation, the first computing device may send the identification information or verification information to the target computing device, such as the second computing device in the foregoing embodiment, the target computing device determines according to the identification information or verification information Whether there is a corresponding calculation model locally, and the judgment result is fed back to the first computing device, so that the first computing device can determine whether there is a calculation model in the target computing device. Step S404: If it is determined according to the model information that the calculation model does not exist in the target computing device, the calculation model is sent to the target computing device, and the target computing device is instructed to use the target computing device The reasoning acceleration resource set in is loaded into the calculation model. When the model information of the calculation model is the identification information or verification information of the calculation model, this step can be implemented as: if the identification information or the verification information is used, it is determined that the target computing device does not exist The calculation model sends the calculation model to the target calculation device. Including: sending the structure of the calculation model and its data to the target computing device. If the required calculation model does not exist in the target computing device, the calculation model can be sent to the target computing device. The target computing device obtains and stores the calculation model, and if the calculation model is used again in subsequent use, the target computing device can directly obtain it locally. Therefore, it is ensured that the inference processing can be smoothly realized regardless of whether the required calculation model is available in the target computing device. Step S406: Obtain the data to be inferred, and send the data to be inferred to the target computing device to instruct the target computing device to use the inference acceleration resource to call the loaded computing model, and use the computing model to compare the data to the target computing device. Describe the data to be inferred for inference processing. In a feasible manner, the acquiring the data to be inferred and sending the data to be inferred to the target computing device may include: acquiring a reasoning request requesting the calculation model to perform inference processing on the data to be inferred, and Perform semantic analysis on the reasoning request; determine the processing function in the calculation model to be called according to the semantic analysis result, and send the processing function information and the inference data to the target computing device to Instruct the target computing device to perform inference processing on the data to be inferred by calling the processing function indicated by the information of the processing function in the loaded calculation model. Wherein, the information of the processing function can optionally be the API interface information of the processing function. Step S408: Receive the result of the inference processing fed back by the target computing device. In specific implementation, the inference method of this embodiment can be implemented by the inference client of the first computing device in the foregoing embodiment, and the specific implementation of the foregoing process can also refer to the operation of the inference client in the foregoing embodiment, which will not be repeated here. Through this embodiment, inference processing is deployed in different computing devices. Among them, the target computing device is equipped with inference acceleration resources, and the main inference processing can be performed through the computing model, and the current computing device that executes the inference method of this embodiment It can be responsible for the data processing before and after the reasoning processing. When performing inference, the current computing device can first send the model information of the computing model to the target computing device, and the target computing device uses inference acceleration resources to load the corresponding computing model; then, the current computing device sends the data to be inferred to the target computing device. After the target computing device receives the data to be inferred, it can perform inference processing through the loaded computing model. As a result, the decoupling of computing resources used in inference is realized. Inference processing through the computing model and data processing other than inference processing can be achieved through different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem of the fixed CPU/GPU specifications of existing heterogeneous computing machines, which restricts the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios. problem. In addition, for users, when using applications involving inference, the inference calculation can be seamlessly transferred to the remote target computing device with inference acceleration resources, and the interaction between the current computing device and the target computing device It is imperceptible to users. Therefore, it can ensure that the business logic of the application involving reasoning and the user's use habits of reasoning services remain unchanged, and the reasoning is realized at low cost and the user experience is improved. Embodiment 5 Referring to FIG. 6, it shows a flowchart of an inference method according to Embodiment 5 of the present invention. The reasoning method of this embodiment describes the reasoning method of the present invention from the perspective of the second computing device. The reasoning method of this embodiment includes the following steps: Step S502: Obtain the model information of the calculation model used for reasoning sent by the source computing device, and load the calculation model indicated by the model information through the reasoning acceleration resource. In this embodiment, the source computing device may be the first computing device in the foregoing embodiment, and the model information includes, but is not limited to, identification information and/or verification information. Step S504: Obtain the data to be inferred sent by the source computing device, use the inference acceleration resource to call the loaded calculation model, and perform inference processing on the data to be inferred through the calculation model. After the inference acceleration resource is loaded into the calculation model, when the data to be inferred sent from the source computing device is received, the calculation model loaded by the inference acceleration resource can be used to perform inference processing on it. Step S506: Feed back the result of the inference processing to the source computing device. In specific implementation, the inference method of this embodiment can be implemented by the inference server of the second computing device in the foregoing embodiment, and the specific implementation of the foregoing process can also refer to the operation of the inference server in the foregoing embodiment, which will not be repeated here. Through this embodiment, inference processing is deployed in different computing devices. Among them, the current computing device that executes the inference method of this embodiment is equipped with inference acceleration resources, and the main inference processing can be performed through the computing model, and the source computing device It can be responsible for data processing before and after the inference processing. During inference, the source computing device can first send the model information of the computing model to the current computing device, and the current computing device uses inference acceleration resources to load the corresponding computing model; then, the source computing device sends the data to be inferred to the current computing device. After the current computing device receives the data to be inferred, it can perform inference processing through the loaded computing model. As a result, the decoupling of computing resources used in inference is realized. Inference processing through the computing model and data processing other than inference processing can be achieved through different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem of the fixed CPU/GPU specifications of existing heterogeneous computing machines, which restricts the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios. problem. In addition, for users, when using applications involving inference, the inference calculation can be seamlessly transferred to the remote device with inference acceleration resources, and the interaction between the source computing device and the current computing device is important for the user. The user is imperceptible, therefore, it can ensure that the business logic of the application involving reasoning and the user's usage habits of reasoning business remain unchanged, and the reasoning is realized at a low cost and the user experience is improved. Embodiment 6 The reasoning method of this embodiment is still from the perspective of the second computing device to illustrate the reasoning method of the present invention. Referring to FIG. 7, there is shown a flow chart of an inference method according to the sixth embodiment of the present invention. The inference method includes: Step S602: According to the model information of the calculation model, it is determined that the calculation model does not exist locally, and then the source The computing device requests the computing model, and after obtaining the computing model from the source computing device, loads the computing model through the inference acceleration resource. In this embodiment, the model information of the calculation model can be the identification information or verification information of the calculation model; then, this step can be implemented as follows: according to the identification information or the verification information, it is determined that the local Computing model, request the computing model from the source computing device, and after obtaining the computing model from the source computing device, load the computing model through the inference acceleration resource. The calculation model sent by the source computing device includes, but is not limited to, the structure of the calculation model and its corresponding data. In addition, in an optional manner, the reasoning acceleration resource includes one or more types; when the reasoning acceleration resource includes multiple types, different types of reasoning acceleration resources have different usage priorities; then, the reasoning The calculation model indicated by the acceleration resource loading the model information includes: according to a preset load balancing rule and the priority of multiple types of the reasoning acceleration resource, using the reasoning acceleration resource to load the calculation model indicated by the model information. Wherein, the load balancing rule and the priority can be appropriately set by those skilled in the art according to actual needs. Step S604: Obtain the data to be inferred sent by the source computing device, use the inference acceleration resource to call the loaded calculation model, and perform inference processing on the data to be inferred through the calculation model. In a feasible way, this step can be implemented as: acquiring the data to be inferred sent by the source computing device and the information of the processing function in the computing model to be called, and calling the processing in the computing model loaded The processing function indicated by the information of the function performs inference processing on the data to be inferred. Wherein, the information of the processing function can be obtained by the source computing device by analyzing the reasoning request. Optionally, the information of the processing function is API interface information of the processing function. Step S606: Feed back the result of the inference processing to the source computing device. In specific implementation, the inference method of this embodiment can be implemented by the inference server of the second computing device in the foregoing embodiment, and the specific implementation of the foregoing process can also refer to the operation of the inference server in the foregoing embodiment, which will not be repeated here. Through this embodiment, inference processing is deployed in different computing devices. Among them, the current computing device that executes the inference method of this embodiment is equipped with inference acceleration resources, and the main inference processing can be performed through the computing model, and the source computing device It can be responsible for data processing before and after the inference processing. During inference, the source computing device can first send the model information of the computing model to the current computing device, and the current computing device uses inference acceleration resources to load the corresponding computing model; then, the source computing device sends the data to be inferred to the current computing device. After the current computing device receives the data to be inferred, it can perform inference processing through the loaded computing model. As a result, the decoupling of computing resources used in inference is realized. Inference processing through the computing model and data processing other than inference processing can be achieved through different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem of the fixed CPU/GPU specifications of existing heterogeneous computing machines, which restricts the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios. problem. In addition, for users, when using applications involving inference, the inference calculation can be seamlessly transferred to the remote device with inference acceleration resources, and the interaction between the source computing device and the current computing device is important for the user. The user is imperceptible, therefore, it can ensure that the business logic of the application involving reasoning and the user's usage habits of reasoning business remain unchanged, and the reasoning is realized at a low cost and the user experience is improved. Embodiment 7 Referring to FIG. 8, there is shown a schematic structural diagram of an electronic device according to Embodiment 7 of the present invention. The specific embodiment of the present invention does not limit the specific implementation of the electronic device. As shown in FIG. 8, the electronic device may include a processor 702, a communications interface 704, a memory 706, and a communication bus 708. Among them: the processor 702, the communication interface 704, and the memory 706 communicate with each other through the communication bus 708. The communication interface 704 is used to communicate with other electronic devices or servers. The processor 702 is configured to execute the program 710, and specifically can execute the relevant steps in the inference method embodiment in the third or fourth embodiment. Specifically, the program 710 may include a program code, and the program code includes a computer operation instruction. The processor 702 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention. The one or more processors included in the smart device may be the same type of processor, such as one or more CPUs, or different types of processors, such as one or more CPUs and one or more ASICs. The memory 706 is used to store the program 710. The memory 706 may include a high-speed RAM memory, or may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The program 710 may specifically be used to cause the processor 702 to perform the following operations: obtain model information of a calculation model for inference, and send the model information to a target computing device to instruct the target computing device to use the target computing device The set inference acceleration resources are loaded into the calculation model indicated by the model information; the data to be inferred is obtained, and the data to be inferred is sent to the target computing device to instruct the target computing device to use the inference acceleration resource to call loading The calculation model is used to perform inference processing on the data to be inferred through the calculation model; and the result of the inference processing returned by the target computing device is received. In an optional implementation manner, the program 710 is further configured to cause the processor 702 to send the calculation model to the target computing device if it is determined that the calculation model does not exist in the target computing device. In an optional implementation manner, the model information of the calculation model is identification information or verification information of the calculation model; the program 710 is also used to enable the processor 702 to determine if the target computing device is not When the calculation model exists, before sending the calculation model to the target computing device, it is determined whether the calculation model exists in the target computing device through the identification information or the verification information. In an optional implementation manner, the program 710 is also used to enable the processor 702 to obtain data to be inferred and send the data to be inferred to the target computing device: obtain a request for the calculation model to respond to the to-be inferred data. The reasoning data is used for the reasoning request for reasoning processing, and the semantic analysis is performed on the reasoning request; the processing function in the calculation model to be called is determined according to the result of the semantic analysis, and the information of the processing function and the to-be reasoning are determined The data is sent to the target computing device to instruct the target computing device to perform inference processing on the data to be inferred by calling the processing function indicated by the information of the processing function in the loaded calculation model. In an optional implementation manner, the information of the processing function is API interface information of the processing function. For the specific implementation of each step in the formula 710, reference may be made to the corresponding description in the corresponding steps and units in the above-mentioned corresponding inference method embodiment, which will not be repeated here. Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the devices and modules described above can be referred to the corresponding process description in the foregoing method embodiment, which will not be repeated here. Through the electronic device of this embodiment, the inference processing is deployed in different computing devices. Among them, the target computing device is equipped with inference acceleration resources, and the main inference processing can be performed through the computing model, and the inference method of this embodiment can be executed. Current electronic equipment can be responsible for data processing before and after inference processing. When performing inference, the current electronic device can first send the model information of the calculation model to the target computing device, and the target computing device uses the inference acceleration resource to load the corresponding calculation model; then, the current electronic device sends the data to be inferred to the target computing device. After the target computing device receives the data to be inferred, it can perform inference processing through the loaded computing model. As a result, the decoupling of computing resources used in inference is realized. Inference processing through the computing model and data processing other than inference processing can be achieved through different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem of the fixed CPU/GPU specifications of existing heterogeneous computing machines, which restricts the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios. problem. In addition, for users, when using applications involving inference, the inference calculation can be seamlessly transferred to the remote target computing device with inference acceleration resources, and the current interaction between the electronic device and the target computing device It is imperceptible to users. Therefore, it can ensure that the business logic of the application involving reasoning and the user's use habits of reasoning services remain unchanged, and the reasoning is realized at low cost and the user experience is improved. Embodiment 8 Referring to FIG. 9, there is shown a schematic structural diagram of an electronic device according to Embodiment 8 of the present invention. The specific embodiment of the present invention does not limit the specific implementation of the electronic device. As shown in FIG. 9, the electronic device may include: a processor 802, a communications interface (Communications Interface) 804, a memory (memory) 806, and a communication bus 808. Among them, the processor 802, the communication interface 804, and the memory 806 communicate with each other through the communication bus 808. The communication interface 804 is used to communicate with other electronic devices or servers. The processor 802 is configured to execute the program 810, and specifically can execute the relevant steps in the inference method embodiment in the fifth or sixth embodiment. Specifically, the program 810 may include a program code, and the program code includes a computer operation instruction. The processor 802 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention. The one or more processors included in the smart device may be the same type of processor, such as one or more CPUs, or different types of processors, such as one or more CPUs and one or more ASICs. The memory 806 is used to store the program 810. The memory 806 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The program 810 can specifically be used to make the processor 802 perform the following operations: obtain model information of a calculation model used for inference sent by the source computing device, load the calculation model indicated by the model information through the inference acceleration resource; obtain the source calculation The data to be inferred sent by the device uses the calculation model loaded by the inference acceleration resource call to perform inference processing on the data to be inferred through the calculation model; and the result of the inference processing is fed back to the source computing device. In an optional implementation, the model information of the calculation model is identification information or verification information of the calculation model; the program 810 is also used to enable the processor 802 to obtain the calculation for reasoning sent by the source computing device. When the model information of the model is loaded into the calculation model indicated by the model information through the inference acceleration resource: According to the identification information or the verification information, it is determined that the calculation model does not exist locally, and then a request is made to the source computing device The calculation model, after obtaining the calculation model from the source calculation device, loads the calculation model through the inference acceleration resource. In an optional implementation manner, the program 810 is also used to enable the processor 802 to obtain the data to be inferred sent by the source computing device, use the inference acceleration resource to call the loaded calculation model, and use the calculation model to When the data to be inferred performs inference processing: the data to be inferred sent by the source computing device and the information of the processing function in the calculation model to be called are obtained, and the processing function in the calculation model loaded by calling The processing function indicated by the information to perform inference processing on the data to be inferred. In an optional implementation manner, the information of the processing function is API interface information of the processing function. In an optional implementation manner, the reasoning acceleration resource includes one or more types; when the reasoning acceleration resource includes multiple types, different types of reasoning acceleration resources have different usage priorities; the program 810 is also used to make When the processor 802 loads the calculation model indicated by the model information through the inference acceleration resource: according to a preset load balancing rule and the priority of multiple types of the inference acceleration resource, it uses the inference acceleration resource to load the model information The indicated calculation model. For the specific implementation of each step in the program 810, refer to the corresponding description in the corresponding steps and units in the above-mentioned corresponding inference method embodiment, which will not be repeated here. Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the devices and modules described above can be referred to the corresponding process description in the foregoing method embodiment, which will not be repeated here. Through the electronic device of this embodiment, inference processing is deployed in different computing devices. Among them, the current electronic device that executes the inference method of this embodiment is equipped with inference acceleration resources, and the main inference processing can be performed through the computing model. The source computing device can then be responsible for data processing before and after inference processing. When performing inference, the source computing device can first send the model information of the calculation model to the current electronic device, and the current electronic device uses the inference acceleration resource to load the corresponding calculation model; then, the source computing device sends the data to be inferred to the current electronic device. After the current electronic device receives the data to be inferred, it can perform inference processing through the loaded calculation model. As a result, the decoupling of computing resources used in inference is realized. Inference processing through the computing model and data processing other than inference processing can be achieved through different computing devices, one of which is equipped with inference acceleration resources such as GPU. , There is no need for an electronic device to have both CPU and GPU, which effectively solves the problem of the fixed CPU/GPU specifications of existing heterogeneous computing machines, which restricts the deployment of applications involving inference, which makes it impossible to meet the needs of a wide range of inference scenarios. problem. In addition, for users, when using an application involving inference, the inference calculation can be seamlessly transferred to the remote device with inference acceleration resources, and the interaction between the source computing device and the current electronic device is very important for the user. The user is imperceptible, therefore, it can ensure that the business logic of the application involving reasoning and the user's usage habits of reasoning business remain unchanged, and the reasoning is realized at a low cost and the user experience is improved. It should be pointed out that according to the needs of implementation, each component/step described in the embodiment of the present invention can be split into more components/steps, or two or more components/steps or partial operations of components/steps can be combined into New components/steps to achieve the purpose of the embodiments of the present invention. The above method according to the embodiments of the present invention can be implemented in hardware, firmware, or implemented as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), Or the computer code that is originally stored in a remote recording medium or a non-transitory machine-readable medium that is downloaded through the Internet and stored in a local recording medium, so that the method described here can be stored on a general-purpose computer, Such software processing on a dedicated processor or a recording medium of programmable or dedicated hardware (such as ASIC or FPGA). It can be understood that a computer, a processor, a microprocessor controller, or programmable hardware includes a storage component that can store or receive software or computer code (for example, RAM, ROM, flash memory, etc.), when the software or computer code is When accessed and executed by a computer, processor, or hardware, the reasoning method described here is implemented. In addition, when a general-purpose computer accesses the code used to implement the inference method shown here, the execution of the code converts the general-purpose computer into a dedicated computer for executing the inference method shown here. A person of ordinary skill in the art may be aware that the units and method steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the embodiments of the present invention. The above embodiments are only used to illustrate the embodiments of the present invention, and are not intended to limit the embodiments of the present invention. Those of ordinary skill in the relevant technical field can make various changes without departing from the spirit and scope of the embodiments of the present invention. Therefore, all equivalent technical solutions also belong to the scope of the embodiments of the present invention, and the scope of patent protection of the embodiments of the present invention should be limited by the scope of the patent application.

100:電子設備 102:CPU 104:GPU 106:PCIE插槽 108:主機板 202:第一計算設備 204:第二計算設備 2022:推理用戶端 2042:推理伺服端 2044:推理加速資源 S302,S304,S306:步驟 S402,S404,S406,S408:步驟 S502,S504,S506:步驟 S602,S604,S606:步驟 702:處理器 704:通訊介面 706:記憶體 708:通訊匯流排 710:程式 802:處理器 804:通訊介面 806:記憶體 808:通訊匯流排 810:程式100: electronic equipment 102: CPU 104: GPU 106: PCIE slot 108: Motherboard 202: The first computing device 204: second computing device 2022: Inference Client 2042: Inference server 2044: Reasoning Acceleration Resources S302, S304, S306: steps S402, S404, S406, S408: steps S502, S504, S506: steps S602, S604, S606: steps 702: processor 704: Communication Interface 706: Memory 708: Communication Bus 710: Program 802: processor 804: Communication Interface 806: memory 808: Communication Bus 810: program

為了更清楚地說明本發明實施例或現有技術中的技術方案，下面將對實施例或現有技術描述中所需要使用的圖式作簡單地介紹，顯而易見地，下面描述中的圖式僅僅是本發明實施例中記載的一些實施例，對於本領域普通技術人員來講，還可以根據這些圖式獲得其他的圖式。 [圖1]為現有技術中的一種具有推理計算資源的電子設備的結構示意圖； [圖2a]為根據本發明實施例一的一種推理系統的結構方塊圖； [圖2b]為根據本發明實施例的一種推理系統實例的結構示意圖； [圖3a]為根據本發明實施例二的一種推理系統的結構方塊圖； [圖3b]為根據本發明實施例的一種推理系統實例的結構示意圖； [圖3c]為使用圖3b所示推理系統進行推理的過程示意圖； [圖3d]為使用圖3b所示推理系統進行推理的互動示意圖； [圖4]為根據本發明實施例三的一種推理方法的流程圖； [圖5]為根據本發明實施例四的一種推理方法的流程圖； [圖6]為根據本發明實施例五的一種推理方法的流程圖； [圖7]為根據本發明實施例六的一種推理方法的流程圖； [圖8]為根據本發明實施例七的一種電子設備的結構示意圖； [圖9]為根據本發明實施例八的一種電子設備的結構示意圖。In order to more clearly describe the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are merely the present For some of the embodiments recorded in the embodiments of the invention, for those of ordinary skill in the art, other drawings may be obtained based on these drawings. [Figure 1] is a schematic diagram of the structure of an electronic device with inference computing resources in the prior art; [Figure 2a] is a structural block diagram of an inference system according to the first embodiment of the present invention; [Figure 2b] is a schematic structural diagram of an example of an inference system according to an embodiment of the present invention; [Figure 3a] is a block diagram of the structure of an inference system according to the second embodiment of the present invention; [Figure 3b] is a schematic structural diagram of an example of an inference system according to an embodiment of the present invention; [Figure 3c] is a schematic diagram of the inference process using the inference system shown in Figure 3b; [Figure 3d] is an interactive schematic diagram of inference using the inference system shown in Figure 3b; [Figure 4] is a flow chart of an inference method according to the third embodiment of the present invention; [Figure 5] is a flowchart of a reasoning method according to the fourth embodiment of the present invention; [Figure 6] is a flow chart of an inference method according to the fifth embodiment of the present invention; [Fig. 7] is a flowchart of an inference method according to the sixth embodiment of the present invention; [Figure 8] is a schematic structural diagram of an electronic device according to the seventh embodiment of the present invention; [Fig. 9] is a schematic diagram of the structure of an electronic device according to the eighth embodiment of the present invention.

202:第一計算設備 202: The first computing device

204:第二計算設備 204: second computing device

2022:推理用戶端 2022: Inference Client

2042:推理伺服端 2042: Inference server

2044:推理加速資源 2044: Reasoning Acceleration Resources

Claims

An inference system, characterized by comprising a first computing device and a second computing device connected to each other, wherein the first computing device is provided with an inference client, the second computing device is provided with an inference acceleration resource and Inference server among them: The reasoning client is used to obtain model information and data to be inferred of a calculation model for inference, and respectively send the model information and the data to be inferred to the inference server in the second computing device; The reasoning server is used to load and call the calculation model indicated by the model information through the reasoning acceleration resource, perform reasoning processing on the data to be reasoned through the calculation model, and return the reasoning processing to the reasoning client the result of.

The inference system according to claim 1, wherein the inference client is further configured to send the calculation model to the inference server when it is determined that the calculation model does not exist in the second computing device.

The reasoning system according to claim 2, wherein the model information of the calculation model is identification information or verification information of the calculation model; The inference server is also used to determine whether the calculation model exists in the second computing device through the identification information or the verification information, and return the determination result to the inference client.

The reasoning system according to claim 1, wherein: The reasoning client is also used to obtain a reasoning request requesting the calculation model to perform reasoning processing on the data to be reasoned, and perform semantic analysis on the reasoning request, and determine the calculation model to be called according to the semantic analysis result The processing function of, sending the information of the processing function to the inference server; When the inference server performs inference processing on the data to be inferred through the calculation model, by calling the processing function indicated by the information of the processing function in the loaded calculation model, the The data to be inferred is processed inferentially.

The reasoning system according to claim 4, wherein the information of the processing function is API interface information of the processing function.

The reasoning system according to claim 1, wherein the second computing device is provided with one or more types of reasoning acceleration resources; When the reasoning acceleration resource includes multiple types, different types of reasoning acceleration resources have different usage priorities; The reasoning server uses the reasoning acceleration resource according to the preset load balancing rules and the priority of the multiple types of the reasoning acceleration resource.

The inference system according to any one of claims 1 to 6, wherein the first computing device and the second computing device are connected to each other through an elastic network.

The reasoning system according to any one of Claims 1 to 6, wherein the reasoning client is a component inside a deep learning framework embedded in the first computing device, or the reasoning client is a component that can be used by Describes the callable files called by the deep learning framework.

A reasoning method, characterized in that the method includes: Obtain model information of a calculation model for inference, and send the model information to a target computing device to instruct the target computing device to use the inference acceleration resource set in the target computing device to load the calculation indicated by the model information model; Obtain the data to be inferred, and send the data to be inferred to the target computing device to instruct the target computing device to use the inference acceleration resource to call the loaded calculation model, and use the calculation model to analyze the to-be inferred Data is processed by reasoning; Receiving the result of the inference processing fed back by the target computing device.

The method according to claim 9, the method further comprising: If it is determined that the calculation model does not exist in the target computing device, the calculation model is sent to the target computing device.

The method according to claim 10, wherein the model information of the calculation model is identification information or verification information of the calculation model; When it is determined that the calculation model does not exist in the target computing device, before sending the calculation model to the target computing device, the method further includes: using the identification information or the verification Information to determine whether the calculation model exists in the target computing device.

The method according to claim 9, wherein the obtaining the data to be inferred and sending the data to be inferred to the target computing device includes: Acquiring a reasoning request requesting the calculation model to perform reasoning processing on the to-be reasoned data, and performing semantic analysis on the reasoning request; The processing function in the computing model to be called is determined according to the semantic analysis result, and the information of the processing function and the data to be inferred are sent to the target computing device to instruct the target computing device to load by calling The processing function indicated by the information of the processing function in the input calculation model performs inference processing on the data to be inferred.

The method according to claim 12, wherein the information of the processing function is API interface information of the processing function.

A reasoning method, characterized in that the method includes: Obtain the model information of the calculation model used for reasoning sent by the source computing device, and load the calculation model indicated by the model information through the reasoning acceleration resource; Obtain the data to be inferred sent by the source computing device, use the inference acceleration resource to call the loaded calculation model, and perform inference processing on the data to be inferred through the calculation model; The result of the inference processing is fed back to the source computing device.

The method according to claim 14, wherein the model information of the calculation model is identification information or verification information of the calculation model; The acquiring the model information of the calculation model used for reasoning sent by the source computing device and loading the calculation model indicated by the model information through the reasoning acceleration resource includes: According to the identification information or the verification information, it is determined that the calculation model does not exist locally, the calculation model is requested from the source computing device, and after the calculation model is obtained from the source computing device, The reasoning acceleration resource is loaded into the calculation model.

The method according to claim 14, wherein the acquiring the data to be inferred sent by the source computing device uses the calculation model loaded by the inference acceleration resource call, and performs the calculation on the data to be inferred through the calculation model. Inference processing, including: Obtain the data to be inferred sent by the source computing device and the information of the processing function in the calculation model to be called. By calling the processing function indicated by the information of the processing function in the loaded calculation model, Describe the data to be inferred for inference processing.

The method according to claim 16, wherein the information of the processing function is API interface information of the processing function.

The method according to claim 14, wherein the reasoning acceleration resource includes one or more types; When the reasoning acceleration resource includes multiple types, different types of reasoning acceleration resources have different usage priorities; The calculation model for loading the model information instruction through the inference acceleration resource includes: according to a preset load balancing rule and the priority of multiple types of the inference acceleration resource, using the inference acceleration resource to load the model information instruction Calculation model.

An electronic device, comprising: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface communicate with each other through the communication bus; The memory is used to store at least one executable instruction, the executable instruction causes the processor to perform an operation corresponding to the reasoning method described in any one of request items 9 to 13, or the executable instruction The processor is caused to perform an operation corresponding to the inference method described in any one of request items 14 to 18.

A computer storage medium on which a computer program is stored, and when the program is executed by a processor, the reasoning method as described in any one of claims 9 to 13 is realized; or, as described in any one of claims 14 to 18 The reasoning method described.