WO2023206889A1 - Model inference methods and apparatuses, devices, and storage medium - Google Patents

Model inference methods and apparatuses, devices, and storage medium Download PDF

Info

Publication number
WO2023206889A1
WO2023206889A1 PCT/CN2022/115511 CN2022115511W WO2023206889A1 WO 2023206889 A1 WO2023206889 A1 WO 2023206889A1 CN 2022115511 W CN2022115511 W CN 2022115511W WO 2023206889 A1 WO2023206889 A1 WO 2023206889A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
computing core
computing
processed
graph
Prior art date
Application number
PCT/CN2022/115511
Other languages
French (fr)
Chinese (zh)
Inventor
潘能超
王桂彬
董昊
王知践
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023206889A1 publication Critical patent/WO2023206889A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the field of data processing technology, especially to the field of artificial intelligence technology, and further relates to model inference methods, devices, equipment and storage media.
  • the model inference process of the neural network model can be composed of multiple different data processing links. Running different computing cores in the neural network model in sequence can complete different data processing links, thereby realizing the model inference process.
  • the present disclosure provides a model reasoning method, device, equipment and storage medium.
  • a model inference method is provided, applied to GPU, including:
  • each node in the computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to represent the connection to which the edge is connected.
  • a model inference method applied to a CPU, including:
  • each node in the computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to indicate the corresponding node connection of the edge.
  • the data to be processed is sent to the GPU, so that the GPU sequentially runs each computing core according to the running order represented by the computing core diagram, and processes the data to be processed, and completes The inference process of the target neural network model, and feedback of the model inference results to the CPU, wherein the target processing requirement is a request to use the target neural network model to process the data to be processed;
  • a model inference device applied to a GPU, including:
  • the computing core graph receiving module is used to receive the computing core graph corresponding to the target neural network model sent by the CPU, wherein each node in the computing core graph corresponds to each computing core included in the target neural network model, and each edge The direction is used to indicate the running order of the computing cores corresponding to the nodes connected by the edge;
  • the model reasoning module is used to, after receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the target neural network.
  • the reasoning process of the model is used to, after receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the target neural network.
  • a result feedback module is used to feed back model inference results to the CPU.
  • a model inference device applied to a CPU, including:
  • the computing core graph sending module is used to send the pre-constructed computing core graph to the graphics processor GPU, where each node in the computing core graph corresponds to each computing core included in the target model, and the direction of each edge is represented by Yu represents the running order of the computing cores corresponding to the nodes connected by the edge;
  • a data sending module configured to send the data to be processed to the GPU when there is a target processing request, so that the GPU runs each computing core in sequence according to the running order represented by the computing core diagram, based on the operation core in the video memory.
  • Preset storage space process the data to be processed, complete the inference process of the target neural network model, and feed back the model inference results to the CPU, where the target processing requirement is to use the target neural network model Requests for processing of pending data;
  • the first result receiving module is used to receive the model inference result fed back by the GPU.
  • an electronic device including:
  • At least one GPU At least one GPU
  • the memory stores instructions that can be executed by the at least one GPU, and the instructions are executed by the at least one GPU to enable the at least one GPU to execute any one of the model inference methods applied to the GPU. method.
  • an electronic device including:
  • the memory stores instructions that can be executed by the at least one CPU, and the instructions are executed by the at least one CPU to enable the at least one CPU to execute any one of the model inference methods applied to the CPU. method.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute a model inference method applied to a GPU or a model inference method applied to a CPU.
  • the method described in any of the model inference methods is provided, wherein the computer instructions are used to cause the computer to execute a model inference method applied to a GPU or a model inference method applied to a CPU. The method described in any of the model inference methods.
  • a computer program product includes a computer program that, when executed by a processor, implements any of the model inference method applied to a GPU or the model inference method applied to a CPU. method.
  • the GPU obtains in advance the computing core diagram corresponding to the target neural network model sent by the CPU.
  • the above computing core diagram contains each computing core included in the target neural network model and can represent The running order of each operation core in the target neural network model.
  • the GPU After the GPU receives the data to be processed sent by the CPU, it can call the above-mentioned computing core graph, run each computing core in sequence according to the running order represented by the above-mentioned computing core graph, process the above-mentioned data to be processed, and complete the model inference process.
  • the CPU sends each computing core to the GPU in sequence
  • the CPU can send each computing core to the GPU.
  • the GPU can directly perform model inference based on the computing core graph, and there is no need to interact with the computing cores between the CPU and the GPU.
  • the number of interactions between the CPU and the GPU Less, which can reduce the impact of the interaction between the CPU and the GPU on the GPU model inference, thereby improving the efficiency of the GPU model inference.
  • Figure 1 is a schematic flow chart of the first model reasoning method provided by an embodiment of the present disclosure
  • Figure 2 is a schematic structural diagram of a computing core graph provided by an embodiment of the present disclosure
  • Figure 3A is a schematic flowchart of the second model reasoning method provided by an embodiment of the present disclosure
  • Figure 3B is a schematic diagram of the first computing core map selection process provided by an embodiment of the present disclosure.
  • Figure 4 is a schematic flow chart of a third model reasoning method provided by an embodiment of the present disclosure.
  • Figure 5A is a schematic flowchart of the fourth model reasoning method provided by an embodiment of the present disclosure.
  • Figure 5B is a schematic diagram of the second computing core map selection process provided by an embodiment of the present disclosure.
  • Figure 6A is a schematic flowchart of the fifth model reasoning method provided by an embodiment of the present disclosure.
  • Figure 6B is a schematic flowchart of the sixth model reasoning method provided by an embodiment of the present disclosure.
  • Figure 7 is a schematic structural diagram of a first model reasoning device provided by an embodiment of the present disclosure.
  • Figure 8 is a schematic structural diagram of a second model reasoning device provided by an embodiment of the present disclosure.
  • Figure 9 is a schematic structural diagram of a third model reasoning device provided by an embodiment of the present disclosure.
  • Figure 10 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure.
  • FIG. 11 is a schematic block diagram of another electronic device provided by an embodiment of the present disclosure.
  • Embodiments of the present disclosure are applied to application scenarios where the CPU and GPU collaborate to perform model inference. Since the GPU processes data such as images, videos, 3D graphics, and audio very quickly, image recognition, voice interaction, and other data can be efficiently completed through the GPU. Image retrieval and other services.
  • the GPU can complete the above business through the process of model reasoning based on the neural network model, and the CPU can send the computing cores included in the neural network model to the GPU, so that the GPU runs each computing core to complete the model reasoning process.
  • the above-mentioned CPU and GPU can run in the same electronic device, and the above-mentioned electronic device can be a computer, a mobile phone, a server, etc.
  • Electronic devices equipped with the above-mentioned CPU and GPU can receive data processing requests sent by other devices.
  • the above-mentioned data processing requests contain pending data that need to be processed to request the CPU and GPU to complete the model inference process.
  • a schematic flow chart of a first model inference method provided by an embodiment of the present disclosure is applied to a GPU.
  • the above method includes the following steps S101-S103.
  • Each node in the above-mentioned computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to indicate the running order of the computing core corresponding to the node connected by the edge.
  • the GPU can store the above-mentioned operation core map.
  • Different computing cores correspond to different data processing links, and the GPU can implement different data processing links based on different computing cores.
  • the above-mentioned data processing links may include matrix multiplication calculations, data activation processing, data division calculations, etc.
  • the structure of the target neural network model is relatively fixed, that is, the execution order of each data processing link in the data processing process through the target neural network model is relatively fixed, and the running order of each computing core in the target neural network model is relatively fixed, so it can Pre-construct the operation core graph of the target neural network model.
  • the above-mentioned computing core graph can be constructed through the API (Application Programming Interface, application program interface) of CUDA (Compute Unified Device Architecture, unified computing device architecture), then the above-mentioned computing core graph can be called CUDA-Graph (Compute Unified Device Architecture) -Graph, Unified Computing Device Architecture Table).
  • FIG. 2 is a schematic structural diagram of a computing core graph provided by an embodiment of the present disclosure.
  • the computing core diagram of the target neural network model shown in Figure 2 contains 4 nodes in total, nodes 1-4, which correspond to computing cores 1-4 respectively.
  • the arrows between the nodes indicate the operations between the nodes connected by the arrows corresponding to the computing cores. The order is: first execute the computing core corresponding to the node connected to the tail of the arrow, and then execute the computing core corresponding to the node connected to the head of the arrow.
  • Operation cores 1-4 are used for matrix multiplication calculations, matrix addition calculations, matrix multiplication calculations, and convolution processing respectively.
  • the operation core diagram shown in Figure 2 shows that the target neural network model first performs matrix multiplication calculation on the input data, then performs matrix addition calculation and matrix multiplication calculation respectively, and then calculates the calculation result of matrix addition calculation and matrix number.
  • the calculation result of the multiplication calculation is convolved.
  • S102 After receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the inference process of the target neural network model.
  • the above-mentioned GPU receives the above-mentioned data to be processed, it can use the pre-allocated storage space to complete the process of model inference.
  • the address of the above-mentioned pre-allocated storage space is a fixed address corresponding to the above-mentioned target neural network model
  • the size of the pre-allocated storage space is a preset size corresponding to the above-mentioned target neural network model.
  • the size of the above-mentioned pre-allocated storage space may be set by the user based on experience, or may not be less than the sum of the third data amount, the fourth data amount, and the maximum required storage space.
  • the above-mentioned third data amount is the data amount of the above-mentioned target neural network model, specifically, it can be the data amount of the model parameters of the above-mentioned target model, and the above-mentioned fourth data amount is the operation result obtained after data processing based on each operation core.
  • the sum of the amount of data, the above-mentioned maximum required storage space is greater than or equal to the maximum storage space required by the GPU for data processing based on each computing core.
  • the calculation scale obtained by performing data processing on each calculation core in the calculation core map can be estimated through manual estimation or estimation by a pre-written estimation program.
  • the amount of data in the operation results, and the amount of temporary storage space required by the GPU for data processing based on each operation core.
  • the above-mentioned pre-allocated storage space needs to be able to accommodate each operation.
  • the operation result of the core that is, the size of the pre-allocated storage space needs to be greater than or equal to the sum of the data amounts of the operation results of each operation core, that is, greater than the above-mentioned second data amount.
  • the temporary storage space corresponding to each computing core is used to store the calculation intermediate values generated during the data processing process of the GPU based on the computing core.
  • the calculation intermediate values stored in the temporary storage space will be released, so the GPU can reuse the same temporary storage space during data processing based on different computing cores.
  • the above temporary storage space needs to be able to accommodate the data processing process based on the computing cores.
  • the temporary storage space that meets the above requirements can be called the maximum required storage space.
  • the size of the above pre-allocated storage space needs to be greater than or equal to the size of the maximum required storage space.
  • the above-mentioned pre-allocated storage space also needs to be able to store the target neural network model.
  • the size of the above-mentioned pre-allocated storage space is greater than or equal to the sum of the third data amount, the fourth data amount and the maximum required storage space, so that the GPU can normally store the target neural network model based on the above-mentioned pre-allocated storage space.
  • the calculation intermediate values generated when the computing core performs data processing and the computing results obtained after data processing based on each computing core are used to complete the process of model inference normally.
  • the GPU obtains in advance the computing core diagram corresponding to the target neural network model sent by the CPU.
  • the above computing core diagram contains each computing core included in the target neural network model and can represent The running order of each operation core in the target neural network model.
  • the GPU After the GPU receives the data to be processed sent by the CPU, it can call the above-mentioned computing core graph, run each computing core in sequence according to the running order represented by the above-mentioned computing core graph, process the above-mentioned data to be processed, and complete the model inference process.
  • the CPU sends each computing core to the GPU in sequence
  • the CPU can send each computing core to the GPU.
  • the GPU can directly perform model inference based on the computing core graph, and there is no need to interact with the computing cores between the CPU and the GPU.
  • the number of interactions between the CPU and the GPU Less, which can reduce the impact of the interaction between the CPU and the GPU on the GPU model inference, thereby improving the efficiency of the GPU model inference.
  • step S101 can be implemented through the following step S101A, and the above-mentioned step S102 can be implemented through steps S102A-S102B.
  • the calculation scale corresponding to each computing core graph indicates the amount of data that the computing core included in the computing core graph can process.
  • the amount of data that the GPU can process based on the computing core is a fixed data amount, and the above data amount can be called the computing scale corresponding to the computing core.
  • Each computing core can be set to support mask operation.
  • the computing core processes data whose amount is smaller than its own computing scale, it can expand the data to be processed to its corresponding computing scale before processing the data, so that the GPU
  • the computing core can be used to process data whose data volume is less than or equal to the computing scale corresponding to the computing core.
  • the computing core corresponds to a matrix with a calculation size of 50 ⁇ 50.
  • the GPU processes a matrix with a size of 30 ⁇ 30 based on the computing core, elements can be added to the matrix and the matrix The size is expanded to 50 ⁇ 50, and then processed, and the added elements are removed after the processing result is obtained.
  • the calculation scales corresponding to each computing core included in the above computing core graph can be the same or different. However, in order to enable each computing core included in the computing core graph to process data uniformly, you can select If the corresponding computing cores have the same computing scale, the computing scale of each computing core can be used as the computing scale corresponding to the computing core graph.
  • the calculation scale corresponding to the above-mentioned computing core graph can be set based on the data volume of historical data in the application scenario of the above-mentioned target model.
  • the calculation scale corresponding to the computing core graph can be set to be greater than or equal to the maximum data volume of each historical data in the application scenario, so that the GPU can theoretically process all the data to be processed in the above application scenario based on the above computing core graph.
  • the maximum value of the data volume of the historical data may be multiplied by the first preset ratio to serve as the calculation scale corresponding to the computing kernel graph.
  • the above-mentioned first preset ratio may be 70%, 80%, etc., so that the GPU can process most of the data to be processed included in the above-mentioned application scenario based on the above-mentioned computing kernel map.
  • S101A Receive multiple computing core maps corresponding to the target neural network model sent by the CPU.
  • the computing cores recorded in different computing core maps are all computing cores included in the target model.
  • the structures of different computing core maps are the same, and the only difference is that the calculation scales corresponding to different computing core maps are different.
  • the GPU can store the multiple received computing core graphs in the storage space, so that the stored computing core graphs can be directly called for model inference later.
  • the pre-allocated storage space can be a reusable storage space, that is, no matter which computing core map the CPU chooses to send to the GPU, the GPU can use the pre-allocated storage space in the process of model inference based on the received computing core graph. allocated storage space. Since the larger the calculation scale corresponding to the computing core graph used by the GPU in the process of model inference, the larger the amount of data processed in the process of model inference, and the greater the storage space required. Therefore, the above-mentioned pre-allocated storage If the space can meet the data storage requirements of the corresponding computing core graph with the largest calculation scale, it can be reused for other computing core graphs.
  • the size of the pre-allocated storage space can be determined based on the corresponding computing core graph with the largest calculation scale.
  • a specific method of determining the size of the pre-allocated storage space please refer to the previous description in step S102, which will not be described again here.
  • S102A Based on the first data amount of the data to be processed, select a first computing core graph from each computing core graph.
  • the above-mentioned first computing core graph is a computing core graph whose corresponding calculation scale is not less than the above-mentioned first data amount and is closest to the above-mentioned first data amount.
  • the GPU can perform model inference on data whose data volume is less than or equal to the calculation scale corresponding to the operation core graph based on the operation core graph.
  • the GPU can expand the data volume of the data to be processed. to the calculation scale corresponding to the computing core graph, and then proceed with processing. Therefore, the larger the calculation scale corresponding to the computing core graph, the greater the amount of data actually processed during model inference based on the computing core graph, and the greater the data processing resources consumed.
  • multiple computing core graphs corresponding to different calculation scales are pre-constructed.
  • the CPU sends each computing core graph to the GPU in advance, and the GPU can complete the model inference process based on any one of the computing core graphs.
  • the GPU can select a computing core graph whose corresponding calculation scale is greater than or equal to the first data amount and is closest to the first data amount from multiple computing core graphs, so that the GPU can process the data based on the selected computing core graph.
  • the data is processed, and during the processing, the amount of data that actually needs to be processed is minimal.
  • the calculation scale corresponding to each computing core graph can be any value.
  • the calculation scale corresponding to each computing core graph can be set based on the maximum data volume of each historical data in the application scenario of the target model.
  • the maximum value of the data volume of each historical data in the application scenario of the target model can be determined, multiplied by the maximum value with a different second preset ratio, and the obtained results are used as each computing core.
  • the calculation scale corresponding to the graph can be determined, multiplied by the maximum value with a different second preset ratio, and the obtained results are used as each computing core.
  • the calculation scale corresponding to each computing core graph can be set to 80M, 64M, 48M.
  • the maximum data volume of each historical data in the application scenario of the target model is 100M.
  • S102B Run each computing core in sequence according to the running order represented by the first computing core diagram to process the data to be processed.
  • step S102B is similar to the aforementioned step S102, and will not be described again here.
  • multiple computing core graphs are pre-constructed.
  • the GPU selects the computing core graph whose corresponding computing scale is greater than or equal to the first data amount and is closest to the first data amount to perform model inference. , so that when the GPU can process the data to be processed based on the selected computing core map, the amount of data processed during the processing is minimal, thereby saving the data processing resources of the GPU.
  • FIG. 3B is a schematic diagram of the first computing core map selection process provided by an embodiment of the present disclosure.
  • It includes input module and operation core diagram 1, operation core diagram 2 - operation core diagram n, a total of n operation core diagrams.
  • the arrows between the input module and each operation core diagram indicate that the GPU can actually perform the first operation based on the input data to be processed.
  • the amount of data is selected from the calculation core map 1, the calculation core map 2 - the calculation core map n, and the selected calculation core map is used for model inference.
  • the arrows between each computing core graph and the pre-allocated storage space indicate that the GPU shares the same pre-allocated storage space during model inference based on each computing core graph.
  • FIG 4 is a schematic flow chart of the third model reasoning method provided by the embodiment of the present disclosure.
  • the above step S102 can be realized by the following step S102C, and the above step S103 can be realized by the following steps. S103A implemented.
  • S102C After receiving multiple to-be-processed data sent by the above-mentioned CPU, merge the multiple to-be-processed data into merged data, call the above-mentioned computing core graph, and run each computing core in sequence according to the running order represented by the above-mentioned computing core graph.
  • the merged data are processed separately to complete the inference process of the above target neural network model.
  • the plurality of data to be processed are all data to be processed by the target neural network model. Therefore, the same computing core graph based on the target neural network model can process multiple data to be processed.
  • the CPU when the CPU receives multiple data processing requests, if there are multiple data processing requests requesting data processing through the target neural network model, the to-be-used data contained in the above data processing requests can be The processed data are jointly sent to the GPU, so that the GPU receives the above multiple data to be processed.
  • the GPU can uniformly expand the data volume of each data to be processed to the maximum data volume of each data to be processed, and then merge the data to be processed into one piece of merged data.
  • the GPU can only call the operation core graph once, and process the merged data based on the operation core graph, which is equivalent to completing the processing of multiple data to be processed.
  • the process of the GPU processing the merged data is similar to the content shown in the previous step S102, and the way the GPU expands the data volume of the data to be processed is similar to the content shown in the previous step S102A, which will not be described in detail in this embodiment. .
  • S103A Extract the model inference results corresponding to each data to be processed from the model inference results of the merged data, and feed back the model inference results corresponding to each data to be processed to the CPU respectively.
  • the processing results corresponding to each data to be processed can be extracted from the model inference results of the merged data, and then the processing results corresponding to the expanded data are removed from them, so as to obtain each Model inference results corresponding to the data to be processed.
  • the GPU can merge the data to be processed into one merged data, and then call the computing kernel graph to process the merged data, which is equivalent to processing each data to be processed.
  • the data were processed uniformly.
  • the GPU only needs to call the computing core map once to complete the processing of multiple to-be-processed data.
  • the computing core map is called The number of graphs is smaller, which can further improve the model inference efficiency of the GPU.
  • FIG 5A is a schematic flow chart of the fourth model reasoning method provided by an embodiment of the present disclosure.
  • the above step S101 can be implemented through the following steps S101B, and the above step S102C can be achieved through the following steps S102C1-S102C2.
  • S101B Receive multiple computing core maps corresponding to the target neural network model sent by the CPU.
  • the computing scale corresponding to each computing core graph represents the amount of data that the computing core included in the computing core graph can process.
  • step S101B is similar to the above-mentioned step S101A, which will not be described again in this embodiment.
  • S102C1 Based on the second data amount, select a second computing core graph from each computing core graph.
  • the above-mentioned second data amount is: the product of the maximum data amount of each data to be processed and the number of data to be processed
  • the above-mentioned second operation core diagram is: the corresponding calculation scale is greater than or equal to and closest to the above-mentioned second data amount. Operation core diagram.
  • the data volume of the data to be processed is less than or equal to the maximum data volume of the data to be processed. After merging each data to be processed to obtain the merged data, the data volume of the merged data will not be greater than the product of the maximum data volume and the number of data to be processed. .
  • the calculation scale corresponding to the selected second computing core graph is greater than or equal to the second data amount, so that the GPU can process the merged data based on the selected second computing core graph, and because the selected second computing core graph corresponds to The calculation scale is closest to the above-mentioned second data amount, so that the GPU processes the merged data based on the selected second computing core graph and consumes the least computing resources overall.
  • S102C2 Call the above-mentioned second computing core graph, run each computing core in sequence according to the running order represented by the above-mentioned second computing core graph, and process the above-mentioned merged data.
  • the method of processing the merged data is similar to the content described in the aforementioned step S102, and will not be described again in this embodiment.
  • multiple computing core graphs are pre-constructed.
  • the GPU selects the second computing core graph whose corresponding computing scale is greater than or equal to the second data amount and is closest to the second data amount. Based on the second computing core graph, when the GPU can process the merged data, the amount of data processed during the processing is minimal, thereby saving the data processing resources of the GPU.
  • FIG. 5B is a schematic diagram of the second computing core map selection process provided by an embodiment of the present disclosure.
  • Figure 5B also contains data to be processed 1, data to be processed 2 - data to be processed m, a total of m data to be processed, and the above data to be processed are all to be passed through the target neural network model.
  • data to be processed there are arrows between each data to be processed and the input, indicating that the GPU can process multiple data to be processed in a unified manner.
  • FIG. 6A a schematic flow chart of a fifth model reasoning method provided by an embodiment of the present disclosure is applied to a CPU.
  • the above method includes the following steps S601-S603.
  • S601 Send the pre-built computing kernel graph to the GPU.
  • Each node in the above-mentioned computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to indicate the running order of the computing core corresponding to the node connected by the edge.
  • S602 When there is a target processing request, send the data to be processed to the above-mentioned GPU, so that the above-mentioned GPU sequentially runs each computing core according to the running order represented by the above-mentioned computing core diagram, processes the above-mentioned data to be processed, and completes the above-mentioned goal.
  • the above target processing request is a request to process the data to be processed using the above target neural network model.
  • the above-mentioned steps S601-S603 are similar to the above-mentioned steps S101-S103, and the difference is only that the execution subject is different, which will not be described again here.
  • the CPU sends the computing core map to the GPU, and the GPU can run each computing core in sequence according to the above-mentioned computing core map to process the data to be processed, thereby completing the model inference of the target neural network model. process.
  • the CPU only needs to send the computing core map to the GPU once, so that the GPU can subsequently complete the model inference process based on the received computing core map.
  • the number of interactions between the CPU and the GPU in this embodiment is less, which can reduce the number of interactions between the CPU and the GPU.
  • the impact of the interaction on GPU model inference can improve the efficiency of GPU model inference.
  • the embodiment of the present disclosure can implement the above step S601 through the following step A.
  • Step A Send the pre-built each computing core graph to the GPU.
  • the calculation scale corresponding to each computing core map indicates: the amount of data that the computing cores included in the computing core map can process.
  • the GPU after the GPU receives multiple computing core maps sent by the CPU, it can process the data to be processed according to the aforementioned steps S102A-S102B, which will not be described again here.
  • multiple computing core graphs are pre-constructed.
  • the GPU can calculate the scale from each computing core graph based on the amount of data to be processed that actually needs to be processed. And the computing core graph closest to the above-mentioned data amount is used for data processing, thereby saving the data processing resources of the GPU.
  • FIG. 6B a schematic flow chart of a sixth model inference method provided by an embodiment of the present disclosure is shown. Compared with the aforementioned embodiment shown in Figure 6A, the following steps S604-S605 are included after the above-mentioned step S601.
  • the calculation scale corresponding to the above-mentioned computing core graph represents the amount of data that can be processed by the computing cores included in the computing core graph.
  • the above-mentioned preset order is the execution order of each target computing core specified by the above-mentioned target neural network model.
  • the above-mentioned The amount of data that the target computing core can process is not less than the amount of data to be processed.
  • the GPU can process the data to be processed based on the above operation core graph. Otherwise, the GPU cannot process the data to be processed based on the above operation core graph.
  • the computing core map processes the data to be processed
  • the GPU can send a request to the CPU to indicate that it cannot process the data to be processed based on the computing core graph, so as to request the CPU to assist in completing the data processing process in other ways.
  • the CPU can determine that the calculation scale corresponding to the pre-built computing core graph is smaller than the data volume of the above-mentioned data to be processed, and then steps S604-S605 can be executed.
  • each target computing core corresponds to a different data processing link in the target neural network model.
  • the GPU sequentially runs each target computing core to complete each data processing link in the target neural network model, thereby realizing the target neural network.
  • the model inference process of the model is one embodiment of the present disclosure.
  • the preset order in which the CPU sends target computing cores to the GPU is the same as the running order of each computing core represented by the computing core diagram.
  • the data processing links corresponding to the target computing core and the computing core included in the aforementioned computing core diagram are the same. The only difference is that the corresponding data volume that can be processed is different. The data volume based on the data that the target computing core can process is larger.
  • the GPU can run the target computing core to complete data processing.
  • the CPU sends each target computing core to the GPU in a preset order, and the GPU can receive the target computing core in the order in which it is received. Run each target computing core in sequence to complete the reasoning process of the target neural network model.
  • the calculation scale corresponding to the operation core graph constructed by the embodiment of the present disclosure does not need to be too large.
  • the calculation scale corresponding to the operation core graph is smaller than the amount of data to be processed, resulting in the GPU being unable to complete the target neural network model based on the operation core graph.
  • this embodiment is not limited to only realizing model inference based on the computing core graph. Instead, the CPU can send each target computing core to the GPU in sequence to ensure that the model inference process can be realized normally.
  • embodiments of the present disclosure also provide a model reasoning device.
  • FIG. 7 it is a schematic structural diagram of a first model inference device provided by an embodiment of the present disclosure, which is applied to a GPU.
  • the above device includes the following modules 701-703.
  • the computing core graph receiving module 701 is used to receive the computing core graph corresponding to the target neural network model sent by the CPU, wherein each node in the computing core graph corresponds to each computing core included in the target neural network model, and each edge The direction of is used to indicate the running order of the computing cores corresponding to the nodes connected by the edge;
  • the model reasoning module 702 is configured to, after receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the target neural network.
  • Result feedback module 703 is used to feed back model inference results to the CPU.
  • the GPU obtains in advance the computing core diagram corresponding to the target neural network model sent by the CPU.
  • the above computing core diagram contains each computing core included in the target neural network model and can represent The running order of each operation core in the target neural network model.
  • the GPU After the GPU receives the data to be processed sent by the CPU, it can call the above-mentioned computing core graph, run each computing core in sequence according to the running order represented by the above-mentioned computing core graph, process the above-mentioned data to be processed, and complete the model inference process.
  • the CPU sends each computing core to the GPU in sequence
  • the CPU can send each computing core to the GPU.
  • the GPU can directly perform model inference based on the computing core graph, and there is no need to interact with the computing cores between the CPU and the GPU.
  • the number of interactions between the CPU and the GPU Less, which can reduce the impact of the interaction between the CPU and the GPU on the GPU model inference, thereby improving the efficiency of the GPU model inference.
  • the above-mentioned computing core graph receiving module 701 is specifically used for:
  • the model reasoning module 702 is specifically used for:
  • a first computing core graph is selected from each computing core graph, wherein the first computing core graph is the corresponding
  • the calculation scale is not less than the above-mentioned first data amount and is closest to the calculation core graph of the first data amount;
  • Run each computing core sequentially according to the running order represented by the first computing core diagram, process the data to be processed, and complete the reasoning process of the target neural network model.
  • multiple computing core graphs are pre-constructed.
  • the GPU selects the computing core graph whose corresponding computing scale is greater than or equal to the first data amount and is closest to the first data amount to perform model inference. , so that when the GPU can process the data to be processed based on the selected computing core map, the amount of data processed during the processing is minimal, thereby saving the data processing resources of the GPU.
  • FIG 8 is a schematic structural diagram of a second model inference device provided by an embodiment of the present disclosure.
  • the above model inference module 702 includes:
  • the data processing sub-module 702A is configured to, after receiving multiple data to be processed sent by the CPU, merge the multiple data to be processed into merged data, call the operation core diagram, and execute the operation represented by the operation core diagram. Run each computing core in sequence to process the merged data to complete the reasoning process of the target neural network model, wherein the plurality of data to be processed are data to be processed by the target neural network model;
  • the result feedback module 703 includes:
  • the result feedback sub-module 703A is configured to extract the model inference results corresponding to each data to be processed from the model inference results of the merged data, and feed back the model inference results corresponding to each data to be processed to the CPU respectively.
  • the GPU can merge the data to be processed into one merged data, and then call the computing kernel graph to process the merged data, which is equivalent to processing each data to be processed.
  • the data were processed uniformly.
  • the GPU only needs to call the computing core map once to complete the processing of multiple to-be-processed data.
  • the computing core map is called The number of graphs is smaller, which can further improve the model inference efficiency of the GPU.
  • the computing core graph receiving module 701 is specifically used to:
  • the data processing sub-module 702A is specifically used for:
  • the plurality of data to be processed After receiving the plurality of data to be processed sent by the CPU, the plurality of data to be processed are merged into merged data, and based on the second data amount, a second computing core graph is selected from each computing core graph, wherein the third
  • the second data volume is: the product of the maximum data volume of each data to be processed and the number of data to be processed.
  • the second computing core diagram is: the computing core whose corresponding computing scale is greater than or equal to the second data volume and is closest to the second data volume. picture;
  • the second computing core graph is called, each computing core is sequentially run according to the running order represented by the second computing core graph, the merged data is processed, and the reasoning process of the target neural network model is completed.
  • multiple computing core graphs are pre-constructed.
  • the GPU selects the second computing core graph whose corresponding computing scale is greater than or equal to the second data amount and is closest to the second data amount. Based on the second computing core graph, when the GPU can process the merged data, the amount of data processed during the processing is minimal, thereby saving the data processing resources of the GPU.
  • the size of the pre-allocated storage space required by the GPU to complete the inference of the target neural network model is not less than the third data amount, the fourth data amount and the maximum The sum of the required storage space;
  • the third data amount is the data amount of the target neural network model
  • the fourth data amount is the sum of data amounts of operation results obtained after data processing based on each operation core
  • the maximum required storage The space is the maximum storage space required for data processing based on each computing core.
  • the size of the above-mentioned pre-allocated storage space is greater than or equal to the sum of the third data amount, the fourth data amount and the maximum required storage space, so that the GPU can normally complete model inference based on the above-mentioned pre-allocated storage space. process.
  • embodiments of the present disclosure also provide a model reasoning device applied to the CPU.
  • FIG 9 is a schematic structural diagram of a third model reasoning device provided by an embodiment of the present disclosure.
  • the above device includes the following modules 901-903.
  • the computing core graph sending module 901 is used to send the pre-constructed computing core graph to the graphics processor GPU, where each node in the computing core graph corresponds to each computing core included in the target model, and the direction of each edge Used to represent the running order of the computing cores corresponding to the nodes connected by the edge;
  • the data sending module 902 is used to send the data to be processed to the GPU when there is a target processing request, so that the GPU runs each computing core in sequence according to the running order represented by the computing core graph, based on the operation core in the video memory.
  • the preset storage space is used to process the data to be processed, complete the inference process of the target neural network model, and feed back the model inference results to the CPU, where the target processing requirement is to use the target neural network Model requests to process data to be processed;
  • the first result receiving module 903 is used to receive the model inference result fed back by the GPU.
  • the CPU sends the computing core map to the GPU, and the GPU can run each computing core in sequence according to the above-mentioned computing core map to process the data to be processed, thereby completing the model inference of the target neural network model. process.
  • the CPU only needs to send the computing core map to the GPU once, so that the GPU can subsequently complete the model inference process based on the received computing core map.
  • the number of interactions between the CPU and the GPU in this embodiment is less, which can reduce the number of interactions between the CPU and the GPU.
  • the impact of the interaction on GPU model inference can improve the efficiency of GPU model inference.
  • the computing core map sending module 901 is specifically used to:
  • the computing scale corresponding to each computing core graph represents: the data that the computing cores contained in the computing core graph can process. quantity.
  • multiple computing core graphs are pre-constructed.
  • the GPU can calculate the scale from each computing core graph based on the amount of data to be processed that actually needs to be processed. And the computing core graph closest to the above-mentioned data amount is used for data processing, thereby saving the data processing resources of the GPU.
  • the above device further includes:
  • the computing core sending module is used to send the data to be processed to the GPU when it is determined that the calculation scale corresponding to the pre-built computing core graph is smaller than the data volume of the data to be processed, and calculate each target in a preset order.
  • the core is sent to the GPU, so that the GPU sequentially runs each target computing core in the order in which it receives each target computing core, processes the data to be processed, completes the inference process of the target neural network model, and reports to the CPU feedback model inference results;
  • the calculation scale corresponding to the operation core graph represents the amount of data that the operation cores included in the operation core graph can process, and the preset order is the execution order of each target operation core specified by the target neural network model, The amount of data that the target computing core can process is not less than the amount of data to be processed;
  • the second result receiving module is used to receive the model inference result fed back by the GPU.
  • the calculation scale corresponding to the operation core graph constructed by the embodiment of the present disclosure does not need to be too large.
  • the calculation scale corresponding to the operation core graph is smaller than the amount of data to be processed, resulting in the GPU being unable to complete the target neural network model based on the operation core graph.
  • this embodiment is not limited to only realizing model inference based on the computing core graph. Instead, the CPU can send each target computing core to the GPU in sequence to ensure that the model inference process can be realized normally.
  • the collection, storage, use, processing, transmission, provision and disclosure of user personal information are in compliance with relevant laws and regulations and do not violate public order and good customs.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • An embodiment of the present disclosure provides an electronic device, including:
  • the memory stores instructions that can be executed by the at least one CPU, and the instructions are executed by the at least one CPU to enable the at least one CPU to execute any one of the model inference methods applied to the GPU. Method steps.
  • Embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute a model inference method applied to a CPU and a model inference method applied to a GPU.
  • Embodiments of the present disclosure provide a computer program product, including a computer program that, when executed by a processor, implements a model inference method applied to a CPU and a model inference method applied to a GPU.
  • FIG. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure.
  • Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 1000 includes a GPU 1001, which can execute various tasks according to a computer program stored in a read-only memory (ROM) 1002 or loaded from a storage unit 1008 into a random access memory (RAM) 1003. Proper action and handling.
  • RAM 1003 various programs and data required for the operation of the device 1000 can also be stored.
  • GPU 1001, ROM 1002 and RAM 1003 are connected to each other through bus 1004.
  • An input/output (I/O) interface 1005 is also connected to bus 1004.
  • I/O interface 1005 Multiple components in the device 1000 are connected to the I/O interface 1005, including: input unit 1006, such as a keyboard, mouse, etc.; output unit 1007, such as various types of displays, speakers, etc.; storage unit 1008, such as a magnetic disk, optical disk, etc. ; and communication unit 1009, such as a network card, modem, wireless communication transceiver, etc.
  • the communication unit 1009 allows the device 1000 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • GPU 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. GPU 1001 performs various methods and processes described above, such as model inference methods.
  • the speech translation method and the model training method can be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 1008.
  • part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009.
  • GPU 1001 may be configured to perform model inference methods in any other suitable manner (eg, via firmware).
  • FIG 11 shows a schematic block diagram of an example electronic device 1100 that may be used to implement another embodiment of the present disclosure.
  • Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 1100 includes a CPU 1101 that can execute various functions according to a computer program stored in a read-only memory (ROM) 1102 or loaded from a storage unit 1108 into a random access memory (RAM) 1103 . Proper action and handling. In the RAM 1103, various programs and data required for the operation of the device 1100 can also be stored.
  • CPU 1101, ROM 1102 and RAM 1103 are connected to each other through bus 1104.
  • An input/output (I/O) interface 1105 is also connected to bus 1104.
  • I/O interface 1105 Multiple components in the device 1100 are connected to the I/O interface 1105, including: input unit 1106, such as a keyboard, mouse, etc.; output unit 1107, such as various types of displays, speakers, etc.; storage unit 1108, such as a magnetic disk, optical disk, etc. ; and communication unit 1109, such as a network card, modem, wireless communication transceiver, etc.
  • the communication unit 1109 allows the device 1100 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • CPU 1101 may be a variety of general and/or special purpose processing components having processing and computing capabilities.
  • the CPU 1101 executes various methods and processes described above, such as the model inference method.
  • the speech translation method and the model training method can be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 1108.
  • part or all of the computer program may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109.
  • the CPU 1101 may be configured to perform the model inference method in any other suitable manner (eg, by means of firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system
  • CPLD complex programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
  • Computer systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, a distributed system server, or a server combined with a blockchain.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present disclosure provides model inference methods and apparatus, devices and a storage medium, and relates to the technical field of data processing, in particular to the technical field of artificial intelligence. A method is applied to a GPU, and has a specific implementation solution as follows: receiving an operation kernel graph corresponding to a target neural network model sent by a CPU, nodes in the operation kernel graph respectively corresponding to operation kernels contained in the target neural network model, and the direction of each edge being used for representing a running sequence of operation kernels corresponding to nodes connected to said edge; after data to be processed sent by the CPU is received, according to the running sequence represented by the operation kernel graph, successively running the operation kernels to process said data so as to complete an inference process of the target neural network model; and feeding a model inference result back to the CPU. When the solution provided by the embodiments of the present disclosure is applied to model inference, the model inference efficiency of the GPU can be improved.

Description

模型推理方法、装置、设备及存储介质Model inference methods, devices, equipment and storage media
本公开要求于2022年4月26日提交中国专利局、申请号为202210450393.0发明名称为“模型推理方法、装置、设备及存储介质”的中国专利公开的优先权,其全部内容通过引用结合在本公开中。This disclosure claims priority to the Chinese patent disclosure with the application number 202210450393.0, which was submitted to the China Patent Office on April 26, 2022 and the invention is titled "Model Inference Method, Device, Equipment and Storage Medium", the entire content of which is incorporated herein by reference. Public.
技术领域Technical field
本公开涉及数据处理技术领域,尤其涉及人工智能技术领域,进一步涉及模型推理方法、装置、设备及存储介质。The present disclosure relates to the field of data processing technology, especially to the field of artificial intelligence technology, and further relates to model inference methods, devices, equipment and storage media.
背景技术Background technique
神经网络模型的模型推理过程可以由多个不同的数据处理环节组成,依次运行神经网络模型中的不同运算核(kernel)可以完成不同的数据处理环节,从而实现模型推理过程。The model inference process of the neural network model can be composed of multiple different data processing links. Running different computing cores in the neural network model in sequence can complete different data processing links, thereby realizing the model inference process.
发明内容Contents of the invention
本公开提供了一种模型推理方法、装置、设备及存储介质。The present disclosure provides a model reasoning method, device, equipment and storage medium.
根据本公开的一方面,提供了一种模型推理方法,应用于GPU,包括:According to one aspect of the present disclosure, a model inference method is provided, applied to GPU, including:
接收CPU发送的目标神经网络模型对应的运算核图,其中,所述运算核图中的各个节点分别对应目标神经网络模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序;Receive the computing core graph corresponding to the target neural network model sent by the CPU, where each node in the computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to represent the connection to which the edge is connected. The running order of the computing cores corresponding to the nodes;
在接收到所述CPU发送的待处理数据后,按照所述运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程;After receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the inference process of the target neural network model;
向所述CPU反馈模型推理结果。Feed back the model inference results to the CPU.
根据本公开的另一方面,提供了一种模型推理方法,应用于CPU,包括:According to another aspect of the present disclosure, a model inference method is provided, applied to a CPU, including:
将预先构建的运算核图发送至GPU,其中,所述运算核图中的各个节点分别对应目标神经网络模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序;Send the pre-constructed computing core graph to the GPU, where each node in the computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to indicate the corresponding node connection of the edge. The running sequence of the computing cores;
在存在目标处理请求的情况下,将待处理数据发送至所述GPU,以使得所述GPU按照所述运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程,并向所述CPU反馈模型推理结果,其中,所述目标处理需求为使用所述目标神经网络模型对待处理数据进行处理的请求;When there is a target processing request, the data to be processed is sent to the GPU, so that the GPU sequentially runs each computing core according to the running order represented by the computing core diagram, and processes the data to be processed, and completes The inference process of the target neural network model, and feedback of the model inference results to the CPU, wherein the target processing requirement is a request to use the target neural network model to process the data to be processed;
接收所述GPU反馈的模型推理结果。Receive model inference results fed back by the GPU.
根据本公开的另一方面,提供了一种模型推理装置,应用于GPU,包括:According to another aspect of the present disclosure, a model inference device is provided, applied to a GPU, including:
运算核图接收模块,用于接收CPU发送的目标神经网络模型对应的运算核图,其中,所述运算核图中的各个节点分别对应目标神经网络模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序;The computing core graph receiving module is used to receive the computing core graph corresponding to the target neural network model sent by the CPU, wherein each node in the computing core graph corresponds to each computing core included in the target neural network model, and each edge The direction is used to indicate the running order of the computing cores corresponding to the nodes connected by the edge;
模型推理模块,用于在接收到所述CPU发送的待处理数据后,按照所述运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程;The model reasoning module is used to, after receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the target neural network. The reasoning process of the model;
结果反馈模块,用于向所述CPU反馈模型推理结果。A result feedback module is used to feed back model inference results to the CPU.
根据本公开的另一方面,提供了一种模型推理装置,应用于CPU,包括:According to another aspect of the present disclosure, a model inference device is provided, applied to a CPU, including:
运算核图发送模块,用于将预先构建的运算核图发送至图形处理器GPU,其中,所述运算核图中的各个节点分别对应目标模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序;The computing core graph sending module is used to send the pre-constructed computing core graph to the graphics processor GPU, where each node in the computing core graph corresponds to each computing core included in the target model, and the direction of each edge is represented by Yu represents the running order of the computing cores corresponding to the nodes connected by the edge;
数据发送模块,用于在存在目标处理请求的情况下,将待处理数据发送至所述GPU,以使得所述GPU按照所述运算核图表示的运行顺序依次运行各个运算核,基于显存中的预设存储空间,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程,并向所述CPU反馈模型推理结果,其中,所述目标处理需求为使用所述目标神经网络模型对待处理数据进行处理的请求;A data sending module, configured to send the data to be processed to the GPU when there is a target processing request, so that the GPU runs each computing core in sequence according to the running order represented by the computing core diagram, based on the operation core in the video memory. Preset storage space, process the data to be processed, complete the inference process of the target neural network model, and feed back the model inference results to the CPU, where the target processing requirement is to use the target neural network model Requests for processing of pending data;
第一结果接收模块,用于接收所述GPU反馈的模型推理结果。The first result receiving module is used to receive the model inference result fed back by the GPU.
根据本公开的另一方面,提供了一种电子设备,包括:According to another aspect of the present disclosure, an electronic device is provided, including:
至少一个GPU;以及At least one GPU; and
与所述至少一个GPU通信连接的存储器;其中,A memory communicatively connected to the at least one GPU; wherein,
所述存储器存储有可被所述至少一个GPU执行的指令,所述指令被所述至少一个GPU执行,以使所述至少一个GPU能够执行应用于GPU的模型推理方法中任一项所述的方法。The memory stores instructions that can be executed by the at least one GPU, and the instructions are executed by the at least one GPU to enable the at least one GPU to execute any one of the model inference methods applied to the GPU. method.
根据本公开的另一方面,提供了一种电子设备,包括:According to another aspect of the present disclosure, an electronic device is provided, including:
至少一个CPU;以及at least one CPU; and
与所述至少一个CPU通信连接的存储器;其中,A memory communicatively connected to the at least one CPU; wherein,
所述存储器存储有可被所述至少一个CPU执行的指令,所述指令被所述至少一个CPU执行,以使所述至少一个CPU能够执行应用于CPU的模型推理方法中任一项所述的方法。The memory stores instructions that can be executed by the at least one CPU, and the instructions are executed by the at least one CPU to enable the at least one CPU to execute any one of the model inference methods applied to the CPU. method.
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行应用于GPU的模型推理方法或应用于CPU的模型推理方法任一项所述的方法。According to another aspect of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to cause the computer to execute a model inference method applied to a GPU or a model inference method applied to a CPU. The method described in any of the model inference methods.
根据本公开的另一方面,一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现应用于GPU的模型推理方法或应用于CPU的模型推理方法任一项所述的方法。According to another aspect of the present disclosure, a computer program product includes a computer program that, when executed by a processor, implements any of the model inference method applied to a GPU or the model inference method applied to a CPU. method.
由以上可见,本公开实施例提供的方案中GPU预先获得了CPU发送的目标神经网络模型对应的运算核图,上述运算核图中包含有目标神经网络模型中包含的各个运算核,并能表示目标神经网络模型中各个运算核的运行顺序。则在GPU接收到CPU发送的待处理数据后,可以调用上述运算核图,按照上述运算核图表示的运行顺序依次运行各个运算核,对上述待处理数据进行处理,完成模型推理过程。与现有技术中CPU将各个运算核依次发送至GPU的方式相比,本实施例中CPU与GPU之间通过发送一次运算核图,CPU便可以把各个运算核均发送至GPU。在GPU后续接收到CPU发送的待处理数据后,GPU可以直接基于运算核图进行模型推理,CPU与GPU之间便不需要再交互运算核,本实施例中CPU与GPU之间进行交互的次数较少,从而可以降低CPU与GPU之间的交互对GPU模型推理的影响,进而可以提高GPU的模型推理效率。It can be seen from the above that in the solution provided by the embodiment of the present disclosure, the GPU obtains in advance the computing core diagram corresponding to the target neural network model sent by the CPU. The above computing core diagram contains each computing core included in the target neural network model and can represent The running order of each operation core in the target neural network model. After the GPU receives the data to be processed sent by the CPU, it can call the above-mentioned computing core graph, run each computing core in sequence according to the running order represented by the above-mentioned computing core graph, process the above-mentioned data to be processed, and complete the model inference process. Compared with the prior art method in which the CPU sends each computing core to the GPU in sequence, in this embodiment, by sending the computing core map once between the CPU and the GPU, the CPU can send each computing core to the GPU. After the GPU subsequently receives the data to be processed sent by the CPU, the GPU can directly perform model inference based on the computing core graph, and there is no need to interact with the computing cores between the CPU and the GPU. In this embodiment, the number of interactions between the CPU and the GPU Less, which can reduce the impact of the interaction between the CPU and the GPU on the GPU model inference, thereby improving the efficiency of the GPU model inference.
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也 不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.
附图说明Description of the drawings
为了更清楚地说明本发明实施例和现有技术的技术方案,下面对实施例和现有技术中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention and the technical solutions of the prior art more clearly, the drawings needed to be used in the embodiments and the prior art are briefly introduced below. Obviously, the drawings in the following description are only for the purpose of explaining the embodiments of the present invention and the technical solutions of the prior art. For some embodiments of the invention, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.
图1为本公开实施例提供的第一种模型推理方法的流程示意图;Figure 1 is a schematic flow chart of the first model reasoning method provided by an embodiment of the present disclosure;
图2为本公开实施例提供的一种运算核图的结构示意图;Figure 2 is a schematic structural diagram of a computing core graph provided by an embodiment of the present disclosure;
图3A为本公开实施例提供的第二种模型推理方法的流程示意图;Figure 3A is a schematic flowchart of the second model reasoning method provided by an embodiment of the present disclosure;
图3B为本公开实施例提供的第一种运算核图选择过程示意图;Figure 3B is a schematic diagram of the first computing core map selection process provided by an embodiment of the present disclosure;
图4为本公开实施例提供的第三种模型推理方法的流程示意图;Figure 4 is a schematic flow chart of a third model reasoning method provided by an embodiment of the present disclosure;
图5A为本公开实施例提供的第四种模型推理方法的流程示意图;Figure 5A is a schematic flowchart of the fourth model reasoning method provided by an embodiment of the present disclosure;
图5B为本公开实施例提供的第二种运算核图选择过程示意图;Figure 5B is a schematic diagram of the second computing core map selection process provided by an embodiment of the present disclosure;
图6A为本公开实施例提供的第五种模型推理方法的流程示意图;Figure 6A is a schematic flowchart of the fifth model reasoning method provided by an embodiment of the present disclosure;
图6B为本公开实施例提供的第六种模型推理方法的流程示意图;Figure 6B is a schematic flowchart of the sixth model reasoning method provided by an embodiment of the present disclosure;
图7为本公开实施例提供的第一种模型推理装置的结构示意图;Figure 7 is a schematic structural diagram of a first model reasoning device provided by an embodiment of the present disclosure;
图8为本公开实施例提供的第二种模型推理装置的结构示意图;Figure 8 is a schematic structural diagram of a second model reasoning device provided by an embodiment of the present disclosure;
图9为本公开实施例提供的第三种模型推理装置的结构示意图;Figure 9 is a schematic structural diagram of a third model reasoning device provided by an embodiment of the present disclosure;
图10为本公开实施例提供的一种电子设备的示意性框图;Figure 10 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure;
图11为本公开实施例提供的另一种电子设备的示意性框图。FIG. 11 is a schematic block diagram of another electronic device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding and should be considered to be exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.
首先,对本公开实施例的应用场景进行说明。First, the application scenarios of the embodiments of the present disclosure are described.
本公开实施例应用于CPU与GPU协同进行模型推理的应用场景,由于GPU对图像、视频、3D图形、音频等数据的处理速度较快,因此可以通过GPU高效率地完成图像识别、语音交互、图像检索等业务。在此过程中GPU可以基于神经网络模型,通过模型推理的过程完成上述业务,则CPU可以将神经网络模型中包含的运算核发送至GPU,使得GPU运行各个运算核完成模型推理的过程。Embodiments of the present disclosure are applied to application scenarios where the CPU and GPU collaborate to perform model inference. Since the GPU processes data such as images, videos, 3D graphics, and audio very quickly, image recognition, voice interaction, and other data can be efficiently completed through the GPU. Image retrieval and other services. In this process, the GPU can complete the above business through the process of model reasoning based on the neural network model, and the CPU can send the computing cores included in the neural network model to the GPU, so that the GPU runs each computing core to complete the model reasoning process.
上述CPU与GPU可以运行于同一电子设备中,上述电子设备可以为计算机、手机、服务器等。安装有上述CPU与GPU的电子设备可以接收其他设备发送的数据处理请求,上述数据处理请求中包含需要处理的待处理数据,以请求CPU与GPU完成模型推理过程。The above-mentioned CPU and GPU can run in the same electronic device, and the above-mentioned electronic device can be a computer, a mobile phone, a server, etc. Electronic devices equipped with the above-mentioned CPU and GPU can receive data processing requests sent by other devices. The above-mentioned data processing requests contain pending data that need to be processed to request the CPU and GPU to complete the model inference process.
以下对本公开实施例提供的模型推理方法进行具体说明。The model inference method provided by the embodiment of the present disclosure is described in detail below.
参见图1,为本公开实施例提供的第一种模型推理方法的流程示意图,应用于GPU,上述方法包括以下步骤S101-S103。Referring to Figure 1, a schematic flow chart of a first model inference method provided by an embodiment of the present disclosure is applied to a GPU. The above method includes the following steps S101-S103.
S101:接收CPU发送的目标神经网络模型对应的运算核图。S101: Receive the computing core graph corresponding to the target neural network model sent by the CPU.
其中,上述运算核图中的各个节点分别对应目标神经网络模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序。Each node in the above-mentioned computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to indicate the running order of the computing core corresponding to the node connected by the edge.
具体的,GPU接收到上述运算核图之后可以存储上述运算核图。不同运算核对应的数据处理环节不同,GPU基于不同的运算核可以实现不同的数据处理环节。例如,上述数据处理环节可以包括矩阵相乘计算、数据激活处理、数据相除计算等。Specifically, after receiving the above-mentioned operation core map, the GPU can store the above-mentioned operation core map. Different computing cores correspond to different data processing links, and the GPU can implement different data processing links based on different computing cores. For example, the above-mentioned data processing links may include matrix multiplication calculations, data activation processing, data division calculations, etc.
另外,目标神经网络模型的结构相对固定,也就是通过目标神经网络模型进行数据处理过程中各个数据处理环节的执行顺序相对固定,则目标神经网络模型中各个运算核的运行顺序相对固定,因此能够预先构建目标神经网络模型的运算核图。In addition, the structure of the target neural network model is relatively fixed, that is, the execution order of each data processing link in the data processing process through the target neural network model is relatively fixed, and the running order of each computing core in the target neural network model is relatively fixed, so it can Pre-construct the operation core graph of the target neural network model.
上述运算核图可以是通过CUDA(Compute Unified Device Architecture,统一计算设备架构)的API(Application Programming Interface,应用程序接口)构建的,则上述运算核图可以被称为CUDA-Graph(Compute Unified Device Architecture-Graph,统一计算设备架构表)。The above-mentioned computing core graph can be constructed through the API (Application Programming Interface, application program interface) of CUDA (Compute Unified Device Architecture, unified computing device architecture), then the above-mentioned computing core graph can be called CUDA-Graph (Compute Unified Device Architecture) -Graph, Unified Computing Device Architecture Table).
参见图2,为本公开实施例提供的一种运算核图的结构示意图。Refer to Figure 2, which is a schematic structural diagram of a computing core graph provided by an embodiment of the present disclosure.
图2所示的目标神经网络模型的运算核图中包含节点1-4共4个节点,分别对应运算核1-4,节点之间的箭头表示箭头所连接的节点对应运算核之间的运行顺序为:先执行箭头尾部连接的节点对应的运算核,再执行箭头头部连接的节点对应的运算核。运算核1-4分别用于进行矩阵相乘计算、矩阵相加计算、矩阵数乘计算、卷积处理。图2所示的运算核图表示目标神经网络模型首先对所输入的数据进行矩阵相乘计算,再分别进行矩阵相加计算与矩阵数乘计算,再对矩阵相加计算的计算结果和矩阵数乘计算的计算结果进行卷积处理。The computing core diagram of the target neural network model shown in Figure 2 contains 4 nodes in total, nodes 1-4, which correspond to computing cores 1-4 respectively. The arrows between the nodes indicate the operations between the nodes connected by the arrows corresponding to the computing cores. The order is: first execute the computing core corresponding to the node connected to the tail of the arrow, and then execute the computing core corresponding to the node connected to the head of the arrow. Operation cores 1-4 are used for matrix multiplication calculations, matrix addition calculations, matrix multiplication calculations, and convolution processing respectively. The operation core diagram shown in Figure 2 shows that the target neural network model first performs matrix multiplication calculation on the input data, then performs matrix addition calculation and matrix multiplication calculation respectively, and then calculates the calculation result of matrix addition calculation and matrix number. The calculation result of the multiplication calculation is convolved.
S102:在接收到上述CPU发送的待处理数据后,按照上述运算核图表示的运行顺序依次运行各个运算核,对上述待处理数据进行处理,完成上述目标神经网络模型的推理过程。S102: After receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the inference process of the target neural network model.
具体的,上述GPU接收到上述待处理数据之后,可以利用预先分配的存储空间,完成模型推理的过程。Specifically, after the above-mentioned GPU receives the above-mentioned data to be processed, it can use the pre-allocated storage space to complete the process of model inference.
其中,上述预先分配的存储空间的地址为上述目标神经网络模型对应的固定地址,预先分配的存储空间的大小为上述目标神经网络模型对应的预设大小。Wherein, the address of the above-mentioned pre-allocated storage space is a fixed address corresponding to the above-mentioned target neural network model, and the size of the pre-allocated storage space is a preset size corresponding to the above-mentioned target neural network model.
上述预先分配的存储空间的大小可以为用户基于经验设定的,也可以不小于第三数据量、第四数据量以及最大所需存储空间的大小之和。The size of the above-mentioned pre-allocated storage space may be set by the user based on experience, or may not be less than the sum of the third data amount, the fourth data amount, and the maximum required storage space.
其中,上述第三数据量为上述目标神经网络模型的数据量,具体的,可以为上述目标模型的模型参数的数据量,上述第四数据量为基于各个运算核进行数据处理后得到的运算结果的数据量之和,上述最大所需存储空间的大小大于等于GPU基于各个运算核进行数据处理的过程中所需的最大存储空间的大小。Wherein, the above-mentioned third data amount is the data amount of the above-mentioned target neural network model, specifically, it can be the data amount of the model parameters of the above-mentioned target model, and the above-mentioned fourth data amount is the operation result obtained after data processing based on each operation core. The sum of the amount of data, the above-mentioned maximum required storage space is greater than or equal to the maximum storage space required by the GPU for data processing based on each computing core.
本公开的一个实施例中,可以预先基于运算核图对应的计算规模,通过人为估算或者预先编写的估算程序进行估算等方式,估算出上述运算核图中的各个运算核进行数据处理后得到的运算结果的数据量,以及GPU基于各个运算核进行数据处理的过程中所需的临 时存储空间的大小。In one embodiment of the present disclosure, based on the calculation scale corresponding to the calculation core map, the calculation scale obtained by performing data processing on each calculation core in the calculation core map can be estimated through manual estimation or estimation by a pre-written estimation program. The amount of data in the operation results, and the amount of temporary storage space required by the GPU for data processing based on each operation core.
GPU在基于各个运算核完成数据处理后,会将处理结果存储于存储空间内,因此需要为各个运算核预留不同的存储空间用于存储处理结果,上述预先分配的存储空间需要能够容纳各个运算核的运算结果,也就是预先分配的存储空间的大小需要大于等于各个运算核的运算结果的数据量之和,即大于上述第二数据量。After the GPU completes data processing based on each computing core, it will store the processing results in the storage space. Therefore, different storage spaces need to be reserved for each computing core to store the processing results. The above-mentioned pre-allocated storage space needs to be able to accommodate each operation. The operation result of the core, that is, the size of the pre-allocated storage space needs to be greater than or equal to the sum of the data amounts of the operation results of each operation core, that is, greater than the above-mentioned second data amount.
另外,每一运算核对应的临时存储空间用于存储GPU基于该运算核进行数据处理的过程中产生的计算中间值,对于每一运算核,在GPU基于该运算核完成数据处理的过程后,该临时存储空间中存储的计算中间值会被释放,因此GPU基于不同的运算核进行数据处理的过程中可以复用同一临时存储空间,上述临时存储空间需要能够容纳基于运算核进行数据处理的过程中产生的、数据量最大的计算中间值,满足上述要求的临时存储空间可以被称为最大所需存储空间,上述预先分配的存储空间的大小需要大于等于最大所需存储空间的大小。In addition, the temporary storage space corresponding to each computing core is used to store the calculation intermediate values generated during the data processing process of the GPU based on the computing core. For each computing core, after the GPU completes the data processing process based on the computing core, The calculation intermediate values stored in the temporary storage space will be released, so the GPU can reuse the same temporary storage space during data processing based on different computing cores. The above temporary storage space needs to be able to accommodate the data processing process based on the computing cores. The temporary storage space that meets the above requirements can be called the maximum required storage space. The size of the above pre-allocated storage space needs to be greater than or equal to the size of the maximum required storage space.
再者,上述预先分配的存储空间还需要能够存储目标神经网络模型。Furthermore, the above-mentioned pre-allocated storage space also needs to be able to store the target neural network model.
上述预先分配的存储空间的大小大于等于第三数据量、第四数据量以及最大所需存储空间的大小之和能够使得GPU基于上述预先分配的存储空间能够正常地存储目标神经网络模型、基于各个运算核进行数据处理时产生的计算中间值以及基于各个运算核进行数据处理后得到的运算结果,从而能够正常地完成模型推理的过程。The size of the above-mentioned pre-allocated storage space is greater than or equal to the sum of the third data amount, the fourth data amount and the maximum required storage space, so that the GPU can normally store the target neural network model based on the above-mentioned pre-allocated storage space. The calculation intermediate values generated when the computing core performs data processing and the computing results obtained after data processing based on each computing core are used to complete the process of model inference normally.
S103:向上述CPU反馈模型推理结果。S103: Feed back the model inference results to the above-mentioned CPU.
由以上可见,本公开实施例提供的方案中GPU预先获得了CPU发送的目标神经网络模型对应的运算核图,上述运算核图中包含有目标神经网络模型中包含的各个运算核,并能表示目标神经网络模型中各个运算核的运行顺序。则在GPU接收到CPU发送的待处理数据后,可以调用上述运算核图,按照上述运算核图表示的运行顺序依次运行各个运算核,对上述待处理数据进行处理,完成模型推理过程。与现有技术中CPU将各个运算核依次发送至GPU的方式相比,本实施例中CPU与GPU之间通过发送一次运算核图,CPU便可以把各个运算核均发送至GPU。在后续GPU接收到CPU发送的待处理数据后,GPU可以直接基于运算核图进行模型推理,CPU与GPU之间便不需要再交互运算核,本实施例中CPU与GPU之间进行交互的次数较少,从而可以降低CPU与GPU之间的交互对GPU模型推理的影响,进而可以提高GPU的模型推理效率。It can be seen from the above that in the solution provided by the embodiment of the present disclosure, the GPU obtains in advance the computing core diagram corresponding to the target neural network model sent by the CPU. The above computing core diagram contains each computing core included in the target neural network model and can represent The running order of each operation core in the target neural network model. After the GPU receives the data to be processed sent by the CPU, it can call the above-mentioned computing core graph, run each computing core in sequence according to the running order represented by the above-mentioned computing core graph, process the above-mentioned data to be processed, and complete the model inference process. Compared with the prior art method in which the CPU sends each computing core to the GPU in sequence, in this embodiment, by sending the computing core map once between the CPU and the GPU, the CPU can send each computing core to the GPU. After the subsequent GPU receives the data to be processed sent by the CPU, the GPU can directly perform model inference based on the computing core graph, and there is no need to interact with the computing cores between the CPU and the GPU. In this embodiment, the number of interactions between the CPU and the GPU Less, which can reduce the impact of the interaction between the CPU and the GPU on the GPU model inference, thereby improving the efficiency of the GPU model inference.
参见图3A,为本公开实施例提供的第二种模型推理方法的流程示意图,具体的,可以通过以下步骤S101A实现步骤S101,通过步骤S102A-S102B实现上述步骤S102。Referring to Figure 3A, a schematic flow chart of the second model reasoning method provided by an embodiment of the present disclosure is provided. Specifically, step S101 can be implemented through the following step S101A, and the above-mentioned step S102 can be implemented through steps S102A-S102B.
首先对运算核图的计算规模进行说明:First, the calculation scale of the operation core graph is explained:
不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据的数据量。Different computing core graphs correspond to different calculation scales. The calculation scale corresponding to each computing core graph indicates the amount of data that the computing core included in the computing core graph can process.
并且,GPU能够基于运算核处理的数据的数据量为固定数据量,上述数据量可以被称为运算核对应的计算规模。可以设置各个运算核支持mask(掩码)操作,则在运算核处理数据量小于自身计算规模的数据的情况下,可以将待处理数据扩展至自身对应的计算规模之后再进行数据处理,使得GPU能够基于运算核处理数据量小于等于该运算核对应的计 算规模的数据。Furthermore, the amount of data that the GPU can process based on the computing core is a fixed data amount, and the above data amount can be called the computing scale corresponding to the computing core. Each computing core can be set to support mask operation. When the computing core processes data whose amount is smaller than its own computing scale, it can expand the data to be processed to its corresponding computing scale before processing the data, so that the GPU The computing core can be used to process data whose data volume is less than or equal to the computing scale corresponding to the computing core.
例如,若待处理数据为矩阵,运算核对应计算规模为50×50的矩阵,在GPU基于该运算核处理尺寸为30×30的矩阵的情况下,可以在该矩阵中添加元素,将该矩阵的尺寸扩展至50×50,再进行处理,得到处理结果之后去除所添加的元素的处理结果。For example, if the data to be processed is a matrix, the computing core corresponds to a matrix with a calculation size of 50×50. When the GPU processes a matrix with a size of 30×30 based on the computing core, elements can be added to the matrix and the matrix The size is expanded to 50×50, and then processed, and the added elements are removed after the processing result is obtained.
另外,上述运算核图中包含的各个运算核对应的计算规模可以相同也可以不同,但为了使得运算核图中包含的各个运算核能够统一地对数据进行处理,可以在构建运算核图时选择所对应的计算规模相同的运算核,则可以将各个运算核的计算规模作为该运算核图对应的计算规模。In addition, the calculation scales corresponding to each computing core included in the above computing core graph can be the same or different. However, in order to enable each computing core included in the computing core graph to process data uniformly, you can select If the corresponding computing cores have the same computing scale, the computing scale of each computing core can be used as the computing scale corresponding to the computing core graph.
为了使得GPU能够基于上述运算核图正常地进行数据处理,可以基于上述目标模型的应用场景中历史数据的数据量设置上述运算核图对应的计算规模。In order to enable the GPU to normally perform data processing based on the above-mentioned computing core graph, the calculation scale corresponding to the above-mentioned computing core graph can be set based on the data volume of historical data in the application scenario of the above-mentioned target model.
例如,可以设置运算核图对应的计算规模大于等于应用场景中各个历史数据的数据量的最大值,以使得GPU在理论上能够基于上述运算核图处理上述应用场景中的所有待处理数据。For example, the calculation scale corresponding to the computing core graph can be set to be greater than or equal to the maximum data volume of each historical data in the application scenario, so that the GPU can theoretically process all the data to be processed in the above application scenario based on the above computing core graph.
或者可以将上述历史数据的数据量的最大值乘以第一预设比例,作为上述运算核图对应的计算规模。如,上述第一预设比例可以为70%、80%等,使得GPU基于上述运算核图能够处理上述应用场景中包含的大部分待处理数据。Alternatively, the maximum value of the data volume of the historical data may be multiplied by the first preset ratio to serve as the calculation scale corresponding to the computing kernel graph. For example, the above-mentioned first preset ratio may be 70%, 80%, etc., so that the GPU can process most of the data to be processed included in the above-mentioned application scenario based on the above-mentioned computing kernel map.
S101A:接收CPU发送的目标神经网络模型对应的多个运算核图。S101A: Receive multiple computing core maps corresponding to the target neural network model sent by the CPU.
具体的,不同的运算核图中记录的运算核均为目标模型中包含的运算核,不同运算核图的结构相同,区别仅为不同运算核图对应的计算规模不同。Specifically, the computing cores recorded in different computing core maps are all computing cores included in the target model. The structures of different computing core maps are the same, and the only difference is that the calculation scales corresponding to different computing core maps are different.
本公开实施例中存在多个不同的运算核图,GPU可以将所接收到的多个运算核图均存储于存储空间中,以便后续直接调用所存储的运算核图进行模型推理。In the embodiment of the present disclosure, there are multiple different computing core graphs, and the GPU can store the multiple received computing core graphs in the storage space, so that the stored computing core graphs can be directly called for model inference later.
另外,预先分配的存储空间可以为能够复用的存储空间,即无论CPU选择哪一个运算核图发送至GPU,GPU在基于所接收到的运算核图进行模型推理的过程中均可以使用该预先分配的存储空间。由于GPU在模型推理的过程中使用的运算核图对应的计算规模越大,在进行模型推理的过程中处理的数据的数据量越大,所需的存储空间越大,因此上述预先分配的存储空间若能够满足所对应的计算规模最大的运算核图的数据存储需求,便能够复用于其他运算核图。所以可以基于所对应的计算规模最大的运算核图确定预先分配的存储空间的大小,具体确定预先分配的存储空间的大小的方式可以参见前文步骤S102处的描述,在此不再赘述。In addition, the pre-allocated storage space can be a reusable storage space, that is, no matter which computing core map the CPU chooses to send to the GPU, the GPU can use the pre-allocated storage space in the process of model inference based on the received computing core graph. allocated storage space. Since the larger the calculation scale corresponding to the computing core graph used by the GPU in the process of model inference, the larger the amount of data processed in the process of model inference, and the greater the storage space required. Therefore, the above-mentioned pre-allocated storage If the space can meet the data storage requirements of the corresponding computing core graph with the largest calculation scale, it can be reused for other computing core graphs. Therefore, the size of the pre-allocated storage space can be determined based on the corresponding computing core graph with the largest calculation scale. For a specific method of determining the size of the pre-allocated storage space, please refer to the previous description in step S102, which will not be described again here.
S102A:基于上述待处理数据的第一数据量,从各个运算核图中选择第一运算核图。S102A: Based on the first data amount of the data to be processed, select a first computing core graph from each computing core graph.
其中,上述第一运算核图为所对应的计算规模不小于上述第一数据量且最接近于上述第一数据量的运算核图。Wherein, the above-mentioned first computing core graph is a computing core graph whose corresponding calculation scale is not less than the above-mentioned first data amount and is closest to the above-mentioned first data amount.
另外,参见前文的描述,GPU基于运算核图可以对数据量小于等于该运算核图对应的计算规模的数据进行模型推理,在进行模型推理的过程中,GPU可以将待处理数据的数据量扩展至该运算核图对应的计算规模,再进行处理。因此运算核图对应的计算规模越大,基于该运算核图进行模型推理的过程中实际处理的数据的数据量越大,消耗的数据处理资源越大。In addition, refer to the previous description, the GPU can perform model inference on data whose data volume is less than or equal to the calculation scale corresponding to the operation core graph based on the operation core graph. During the process of model inference, the GPU can expand the data volume of the data to be processed. to the calculation scale corresponding to the computing core graph, and then proceed with processing. Therefore, the larger the calculation scale corresponding to the computing core graph, the greater the amount of data actually processed during model inference based on the computing core graph, and the greater the data processing resources consumed.
本公开实施例中预先构建了多个对应不同计算规模的运算核图,CPU预先将各个运算核图均发送至GPU,GPU便可以基于其中的任意一个运算核图完成模型推理过程。在对待处理数据进行处理之前,GPU可以从多个运算核图中选择所对应的计算规模大于等于且最接近于第一数据量的运算核图,使得GPU基于所选择的运算核图能够对待处理数据进行处理,并且在处理过程中,实际需要处理的数据的数据量最小。In the embodiment of the present disclosure, multiple computing core graphs corresponding to different calculation scales are pre-constructed. The CPU sends each computing core graph to the GPU in advance, and the GPU can complete the model inference process based on any one of the computing core graphs. Before processing the data to be processed, the GPU can select a computing core graph whose corresponding calculation scale is greater than or equal to the first data amount and is closest to the first data amount from multiple computing core graphs, so that the GPU can process the data based on the selected computing core graph. The data is processed, and during the processing, the amount of data that actually needs to be processed is minimal.
其中,各个运算核图对应的计算规模可以为任意值。具体的,可以基于目标模型的应用场景中各个历史数据的数据量的最大值设置各个运算核图所对应的计算规模。Among them, the calculation scale corresponding to each computing core graph can be any value. Specifically, the calculation scale corresponding to each computing core graph can be set based on the maximum data volume of each historical data in the application scenario of the target model.
本公开的一个实施例中,可以确定目标模型的应用场景中各个历史数据的数据量的最大值,以不同的第二预设比例与该最大值相乘,将得到的结果分别作为各个运算核图对应的计算规模。In one embodiment of the present disclosure, the maximum value of the data volume of each historical data in the application scenario of the target model can be determined, multiplied by the maximum value with a different second preset ratio, and the obtained results are used as each computing core. The calculation scale corresponding to the graph.
例如,目标模型的应用场景中各个历史数据的数据量的最大值为80M,上述第二预设比例分别为100%、80%、60%,则可以将各个运算核图对应的计算规模设置为80M、64M、48M。For example, if the maximum data volume of each historical data in the application scenario of the target model is 80M, and the above-mentioned second preset proportions are 100%, 80%, and 60% respectively, then the calculation scale corresponding to each computing core graph can be set to 80M, 64M, 48M.
另外,也可以基于目标模型的应用场景中各个历史数据的数据量的最大值设置各个运算核图对应的计算规模的最大值,再计算上述计算规模的最大值与运算核图数量之间的商值,作为各个运算核图对应的计算规模之间的差值,基于上述差值设置各个运算核图对应的计算规模,使得各个运算核图对应的计算规模为等差数列。In addition, you can also set the maximum value of the calculation scale corresponding to each computing core graph based on the maximum value of the data volume of each historical data in the application scenario of the target model, and then calculate the quotient between the maximum value of the calculation scale and the number of computing core graphs. value, as the difference between the calculation scales corresponding to each operation core map, the calculation scale corresponding to each operation core map is set based on the above difference, so that the calculation scale corresponding to each operation core map is an arithmetic sequence.
例如,目标模型的应用场景中各个历史数据的数据量的最大值为100M,可以将运算核图对应的计算规模的最大值设置为100M,若运算核图的数量为10,则上述商值为10M,可以将各个运算核图对应的计算规模分别设置为10M、20M、30M、40M、50M、60M、70M、80M、90M、100M。For example, the maximum data volume of each historical data in the application scenario of the target model is 100M. You can set the maximum calculation scale corresponding to the computing core graph to 100M. If the number of computing core graphs is 10, the above quotient is 10M, the calculation scale corresponding to each computing core graph can be set to 10M, 20M, 30M, 40M, 50M, 60M, 70M, 80M, 90M, 100M respectively.
S102B:按照上述第一运算核图表示的运行顺序依次运行各个运算核,对上述待处理数据进行处理。S102B: Run each computing core in sequence according to the running order represented by the first computing core diagram to process the data to be processed.
具体的,步骤S102B与前述步骤S102相似,在此不再赘述。Specifically, step S102B is similar to the aforementioned step S102, and will not be described again here.
由以上可见,本公开实施例中预先构建了多个运算核图,在进行模型推理的过程中,GPU选择所对应的计算规模大于等于且最接近于第一数据量的运算核图进行模型推理,使得GPU基于所选择的运算核图能够对待处理数据进行处理的情况下,处理过程中所处理的数据的数据量最小,从而能够节省GPU的数据处理资源。It can be seen from the above that in the embodiment of the present disclosure, multiple computing core graphs are pre-constructed. During the process of model inference, the GPU selects the computing core graph whose corresponding computing scale is greater than or equal to the first data amount and is closest to the first data amount to perform model inference. , so that when the GPU can process the data to be processed based on the selected computing core map, the amount of data processed during the processing is minimal, thereby saving the data processing resources of the GPU.
参见图3B,为本公开实施例提供的第一种运算核图选择过程示意图。Refer to FIG. 3B , which is a schematic diagram of the first computing core map selection process provided by an embodiment of the present disclosure.
其中包含输入模块与运算核图1、运算核图2-运算核图n共n个运算核图,输入模块与各个运算核图之间的箭头表示GPU可以基于输入的待处理数据实际的第一数据量从运算核图1、运算核图2-运算核图n中选择一个,采用所选择的运算核图进行模型推理。各个运算核图与预先分配的存储空间的箭头表示GPU基于各个运算核图进行模型推理的过程中共用同一预先分配的存储空间。It includes input module and operation core diagram 1, operation core diagram 2 - operation core diagram n, a total of n operation core diagrams. The arrows between the input module and each operation core diagram indicate that the GPU can actually perform the first operation based on the input data to be processed. The amount of data is selected from the calculation core map 1, the calculation core map 2 - the calculation core map n, and the selected calculation core map is used for model inference. The arrows between each computing core graph and the pre-allocated storage space indicate that the GPU shares the same pre-allocated storage space during model inference based on each computing core graph.
参见图4,为本公开实施例提供的第三种模型推理方法的流程示意图,与前述图1所示的实施例相比,上述步骤S102可以通过以下步骤S102C实现,上述步骤S103可以通过以下步骤S103A实现。Refer to Figure 4, which is a schematic flow chart of the third model reasoning method provided by the embodiment of the present disclosure. Compared with the aforementioned embodiment shown in Figure 1, the above step S102 can be realized by the following step S102C, and the above step S103 can be realized by the following steps. S103A implemented.
S102C:在接收到上述CPU发送的多个待处理数据后,将多个待处理数据合并为合并数据,调用上述运算核图,按照上述运算核图表示的运行顺序依次运行各个运算核,对上述合并数据分别进行处理,完成上述目标神经网络模型的推理过程。S102C: After receiving multiple to-be-processed data sent by the above-mentioned CPU, merge the multiple to-be-processed data into merged data, call the above-mentioned computing core graph, and run each computing core in sequence according to the running order represented by the above-mentioned computing core graph. The merged data are processed separately to complete the inference process of the above target neural network model.
其中,上述多个待处理数据均为待通过上述目标神经网络模型进行处理的数据,因此基于目标神经网络模型的同一个运算核图便可以实现对多个待处理数据的处理。Among them, the plurality of data to be processed are all data to be processed by the target neural network model. Therefore, the same computing core graph based on the target neural network model can process multiple data to be processed.
本公开的一个实施例中,在CPU接收到多个数据处理请求的情况下,若存在多个请求通过目标神经网络模型进行数据处理的数据处理请求,则可以将上述数据处理请求中包含的待处理数据共同发送给GPU,使得GPU接收到上述多个待处理数据。In one embodiment of the present disclosure, when the CPU receives multiple data processing requests, if there are multiple data processing requests requesting data processing through the target neural network model, the to-be-used data contained in the above data processing requests can be The processed data are jointly sent to the GPU, so that the GPU receives the above multiple data to be processed.
另外,GPU在接收到多个待处理数据后,可以将各个待处理数据的数据量统一扩展至各个待处理数据的最大数据量,再将各个待处理数据合并为一条合并数据。在对合并数据进行处理时,GPU可以仅调用一次运算核图,基于该运算核图对合并数据进行处理,相当于完成了多个待处理数据的处理。In addition, after receiving multiple data to be processed, the GPU can uniformly expand the data volume of each data to be processed to the maximum data volume of each data to be processed, and then merge the data to be processed into one piece of merged data. When processing the merged data, the GPU can only call the operation core graph once, and process the merged data based on the operation core graph, which is equivalent to completing the processing of multiple data to be processed.
具体的,GPU对合并数据进行处理的过程与前文步骤S102所示的内容相似,GPU对待处理数据的数据量进行扩展的方式与前文步骤S102A所示的内容相似,本实施例对此不再赘述。Specifically, the process of the GPU processing the merged data is similar to the content shown in the previous step S102, and the way the GPU expands the data volume of the data to be processed is similar to the content shown in the previous step S102A, which will not be described in detail in this embodiment. .
S103A:从上述合并数据的模型推理结果中提取各个待处理数据对应的模型推理结果,分别向上述CPU反馈各个待处理数据对应的模型推理结果。S103A: Extract the model inference results corresponding to each data to be processed from the model inference results of the merged data, and feed back the model inference results corresponding to each data to be processed to the CPU respectively.
具体的,可以按照合并数据中各个待处理数据的排列顺序,从合并数据的模型推理结果中提取出各个待处理数据对应的处理结果,再从中去除扩展得到的数据对应的处理结果,从而得到各个待处理数据对应的模型推理结果。Specifically, according to the order of the data to be processed in the merged data, the processing results corresponding to each data to be processed can be extracted from the model inference results of the merged data, and then the processing results corresponding to the expanded data are removed from them, so as to obtain each Model inference results corresponding to the data to be processed.
由以上可见,若存在待通过上述目标神经网络模型处理的多个待处理数据,GPU可以将各个待处理数据合并为一条合并数据,再调用运算核图对合并数据进行处理,相当于对各个待处理数据进行了统一的处理。在此过程中GPU仅需要调用一次运算核图便可以完成多个待处理数据的处理过程,与每对一个待处理数据进行处理便需要调用一次运算核图相比,本实施例中调用运算核图的次数较少,可以进一步提高GPU的模型推理效率。It can be seen from the above that if there are multiple data to be processed that need to be processed through the above target neural network model, the GPU can merge the data to be processed into one merged data, and then call the computing kernel graph to process the merged data, which is equivalent to processing each data to be processed. The data were processed uniformly. In this process, the GPU only needs to call the computing core map once to complete the processing of multiple to-be-processed data. Compared with the need to call the computing core map once to process each data to be processed, in this embodiment, the computing core map is called The number of graphs is smaller, which can further improve the model inference efficiency of the GPU.
参见图5A,为本公开实施例提供的第四种模型推理方法的流程示意图,与前述图4所示的实施例相比,上述步骤S101可以通过以下步骤S101B实现,上述步骤S102C可以通过以下步骤S102C1-S102C2。Refer to Figure 5A, which is a schematic flow chart of the fourth model reasoning method provided by an embodiment of the present disclosure. Compared with the aforementioned embodiment shown in Figure 4, the above step S101 can be implemented through the following steps S101B, and the above step S102C can be achieved through the following steps S102C1-S102C2.
S101B:接收CPU发送的目标神经网络模型对应的多个运算核图。S101B: Receive multiple computing core maps corresponding to the target neural network model sent by the CPU.
其中,不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据的数据量。Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph represents the amount of data that the computing core included in the computing core graph can process.
具体的,上述步骤S101B与前述步骤S101A相似,本实施例对此不再赘述。Specifically, the above-mentioned step S101B is similar to the above-mentioned step S101A, which will not be described again in this embodiment.
S102C1:基于第二数据量,从各个运算核图中选择第二运算核图。S102C1: Based on the second data amount, select a second computing core graph from each computing core graph.
其中,上述第二数据量为:各个待处理数据的最大数据量与待处理数据数量的乘积,上述第二运算核图为:所对应的计算规模大于等于且最接近于上述第二数据量的运算核图。Wherein, the above-mentioned second data amount is: the product of the maximum data amount of each data to be processed and the number of data to be processed, and the above-mentioned second operation core diagram is: the corresponding calculation scale is greater than or equal to and closest to the above-mentioned second data amount. Operation core diagram.
具体的,待处理数据的数据量均小于等于待处理数据的最大数据量,将各个待处理数据合并得到合并数据后,合并数据的数据量不会大于最大数据量与待处理数据的数量的乘 积。所选择的第二运算核图对应的计算规模大于等于第二数据量,从而使得GPU基于所选择的第二运算核图能够对合并数据进行处理,并且由于所选择的第二运算核图对应的计算规模最接近于上述第二数据量,使得GPU基于所选择的第二运算核图对合并数据进行处理,在总体上消耗的计算资源最少。Specifically, the data volume of the data to be processed is less than or equal to the maximum data volume of the data to be processed. After merging each data to be processed to obtain the merged data, the data volume of the merged data will not be greater than the product of the maximum data volume and the number of data to be processed. . The calculation scale corresponding to the selected second computing core graph is greater than or equal to the second data amount, so that the GPU can process the merged data based on the selected second computing core graph, and because the selected second computing core graph corresponds to The calculation scale is closest to the above-mentioned second data amount, so that the GPU processes the merged data based on the selected second computing core graph and consumes the least computing resources overall.
S102C2:调用上述第二运算核图,按照上述第二运算核图表示的运行顺序依次运行各个运算核,对上述合并数据进行处理。S102C2: Call the above-mentioned second computing core graph, run each computing core in sequence according to the running order represented by the above-mentioned second computing core graph, and process the above-mentioned merged data.
具体的,对合并数据进行处理的方式与前述步骤S102描述的内容相似,本实施例对此不再赘述。Specifically, the method of processing the merged data is similar to the content described in the aforementioned step S102, and will not be described again in this embodiment.
由以上可见,本公开实施例中预先构建了多个运算核图,在进行模型推理的过程中,GPU选择所对应的计算规模大于等于且最接近于第二数据量的第二运算核图,基于第二运算核图,GPU在能够对合并数据进行处理的情况下,处理过程中所处理的数据的数据量最小,从而能够节省GPU的数据处理资源。It can be seen from the above that in the embodiment of the present disclosure, multiple computing core graphs are pre-constructed. During the process of model inference, the GPU selects the second computing core graph whose corresponding computing scale is greater than or equal to the second data amount and is closest to the second data amount. Based on the second computing core graph, when the GPU can process the merged data, the amount of data processed during the processing is minimal, thereby saving the data processing resources of the GPU.
参见图5B,为本公开实施例提供的第二种运算核图选择过程示意图。Refer to FIG. 5B , which is a schematic diagram of the second computing core map selection process provided by an embodiment of the present disclosure.
与前述图3B所示的实施例相比,图5B中还存在待处理数据1、待处理数据2-待处理数据m共m个待处理数据,上述待处理数据均是待通过目标神经网络模型进行处理的数据,各个待处理数据与输入之间均存在箭头,表示GPU可以对多个待处理数据进行统一处理。Compared with the aforementioned embodiment shown in Figure 3B, Figure 5B also contains data to be processed 1, data to be processed 2 - data to be processed m, a total of m data to be processed, and the above data to be processed are all to be passed through the target neural network model. In the data being processed, there are arrows between each data to be processed and the input, indicating that the GPU can process multiple data to be processed in a unified manner.
参见图6A,为本公开实施例提供的第五种模型推理方法的流程示意图,应用于CPU,上述方法包括以下步骤S601-S603。Referring to Figure 6A, a schematic flow chart of a fifth model reasoning method provided by an embodiment of the present disclosure is applied to a CPU. The above method includes the following steps S601-S603.
S601:将预先构建的运算核图发送至GPU。S601: Send the pre-built computing kernel graph to the GPU.
其中,上述运算核图中的各个节点分别对应目标神经网络模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序。Each node in the above-mentioned computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to indicate the running order of the computing core corresponding to the node connected by the edge.
S602:在存在目标处理请求的情况下,将待处理数据发送至上述GPU,以使得上述GPU按照上述运算核图表示的运行顺序依次运行各个运算核,对上述待处理数据进行处理,完成上述目标神经网络模型的推理过程,并向上述CPU反馈模型推理结果。S602: When there is a target processing request, send the data to be processed to the above-mentioned GPU, so that the above-mentioned GPU sequentially runs each computing core according to the running order represented by the above-mentioned computing core diagram, processes the above-mentioned data to be processed, and completes the above-mentioned goal. The inference process of the neural network model and feeds back the model inference results to the above-mentioned CPU.
其中,上述目标处理请求为使用上述目标神经网络模型对待处理数据进行处理的请求。Wherein, the above target processing request is a request to process the data to be processed using the above target neural network model.
S603:接收上述GPU反馈的模型推理结果。S603: Receive the model inference results fed back by the above GPU.
本公开的一个实施例中,上述步骤S601-S603与前述步骤S101-S103相似,区别仅为执行主体不同,在此不再赘述。In one embodiment of the present disclosure, the above-mentioned steps S601-S603 are similar to the above-mentioned steps S101-S103, and the difference is only that the execution subject is different, which will not be described again here.
由以上可见,本公开实施例提供的方案中CPU将运算核图发送至GPU,GPU便可以按照上述运算核图依次运行各个运算核,对待处理数据进行处理,从而完成目标神经网络模型的模型推理过程。此过程中CPU仅需要向GPU发送一次运算核图,便可以使得GPU能够在后续基于所接收的运算核图完成模型推理过程。与现有技术中在进行模型推理的过程中CPU多次向GPU发送的各个运算核的方式相比,本实施例中CPU与GPU之间进行交互的次数较少,从而可以降低CPU与GPU之间的交互对GPU模型推理的影响,进而可以提高GPU的模型推理效率。It can be seen from the above that in the solution provided by the embodiment of the present disclosure, the CPU sends the computing core map to the GPU, and the GPU can run each computing core in sequence according to the above-mentioned computing core map to process the data to be processed, thereby completing the model inference of the target neural network model. process. In this process, the CPU only needs to send the computing core map to the GPU once, so that the GPU can subsequently complete the model inference process based on the received computing core map. Compared with the prior art method in which the CPU sends various computing cores to the GPU multiple times during model inference, the number of interactions between the CPU and the GPU in this embodiment is less, which can reduce the number of interactions between the CPU and the GPU. The impact of the interaction on GPU model inference can improve the efficiency of GPU model inference.
本公开实施例可以通过以下步骤A实现上述步骤S601。The embodiment of the present disclosure can implement the above step S601 through the following step A.
步骤A:将预先构建的各个运算核图发送至GPU。Step A: Send the pre-built each computing core graph to the GPU.
其中,不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示:该运算核图中包含的运算核能够处理数据的数据量。Different computing core maps correspond to different calculation scales. The calculation scale corresponding to each computing core map indicates: the amount of data that the computing cores included in the computing core map can process.
本公开的一个实施例中,GPU接收到CPU发送的多个运算核图之后,可以按照前述步骤S102A-S102B实现对待处理数据的处理,在此不再赘述。In one embodiment of the present disclosure, after the GPU receives multiple computing core maps sent by the CPU, it can process the data to be processed according to the aforementioned steps S102A-S102B, which will not be described again here.
由以上可见,本公开实施例中预先构建了多个运算核图,在进行模型推理的过程中,GPU可以基于实际需要处理的待处理数据的数据量,从各个运算核图中计算规模大于等于且最接近于上述数据量的运算核图进行数据处理,从而能够节省GPU的数据处理资源。It can be seen from the above that in the embodiment of the present disclosure, multiple computing core graphs are pre-constructed. During the process of model inference, the GPU can calculate the scale from each computing core graph based on the amount of data to be processed that actually needs to be processed. And the computing core graph closest to the above-mentioned data amount is used for data processing, thereby saving the data processing resources of the GPU.
参见图6B,为本公开实施例提供的第六种模型推理方法的流程示意图,与前述图6A所示的实施例相比,上述步骤S601之后还包括以下步骤S604-S605。Referring to Figure 6B, a schematic flow chart of a sixth model inference method provided by an embodiment of the present disclosure is shown. Compared with the aforementioned embodiment shown in Figure 6A, the following steps S604-S605 are included after the above-mentioned step S601.
S604:在确定预先构建的运算核图对应的计算规模小于上述待处理数据的数据量的情况下,向上述GPU发送待处理数据,并按照预设顺序将各个目标运算核发送至GPU,以使得上述GPU按照接收到各个目标运算核的顺序依次运行各个目标运算核,对上述待处理数据进行处理,完成上述目标神经网络模型的推理过程,并向上述CPU反馈模型推理结果。S604: When it is determined that the calculation scale corresponding to the pre-built computing core graph is smaller than the data volume of the above-mentioned data to be processed, send the data to be processed to the above-mentioned GPU, and send each target computing core to the GPU in a preset order, so that The GPU runs each target computing core in sequence in the order in which it is received, processes the data to be processed, completes the inference process of the target neural network model, and feeds back the model inference results to the CPU.
其中,上述运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据的数据量,上述预设顺序为上述目标神经网络模型规定的、各个目标运算核的执行顺序,上述目标运算核能够处理的数据的数据量不小于上述待处理数据的数据量。Wherein, the calculation scale corresponding to the above-mentioned computing core graph represents the amount of data that can be processed by the computing cores included in the computing core graph. The above-mentioned preset order is the execution order of each target computing core specified by the above-mentioned target neural network model. The above-mentioned The amount of data that the target computing core can process is not less than the amount of data to be processed.
具体的,若GPU接收到待处理数据后判定运算核图对应的计算规模大于等于上述待处理数据的数据量,则GPU基于上述运算核图能够对待处理数据进行处理,否则,GPU不能够基于上述运算核图对待处理数据进行处理,则GPU可以向CPU发送请求,表示自身无法基于运算核图对待处理数据进行处理,以请求CPU协助以其他方式完成数据处理的过程。Specifically, if the GPU receives the data to be processed and determines that the calculation scale corresponding to the operation core graph is greater than or equal to the data volume of the data to be processed, the GPU can process the data to be processed based on the above operation core graph. Otherwise, the GPU cannot process the data to be processed based on the above operation core graph. When the computing core map processes the data to be processed, the GPU can send a request to the CPU to indicate that it cannot process the data to be processed based on the computing core graph, so as to request the CPU to assist in completing the data processing process in other ways.
CPU接收到上述请求便可以确定预先构建的运算核图对应的计算规模小于上述待处理数据的数据量,则可以执行步骤S604-S605。After receiving the above request, the CPU can determine that the calculation scale corresponding to the pre-built computing core graph is smaller than the data volume of the above-mentioned data to be processed, and then steps S604-S605 can be executed.
本公开的一个实施例中,各个目标运算核分别对应目标神经网络模型中不同的数据处理环节,GPU依次运行各个目标运算核完成目标神经网络模型中的各个数据处理环节,便可以实现目标神经网络模型的模型推理过程。In one embodiment of the present disclosure, each target computing core corresponds to a different data processing link in the target neural network model. The GPU sequentially runs each target computing core to complete each data processing link in the target neural network model, thereby realizing the target neural network. The model inference process of the model.
CPU向GPU发送目标运算核的预设顺序与运算核图表示的各个运算核的运行顺序相同。目标运算核与前述运算核图中包含的运算核对应的数据处理环节相同,区别仅为所对应的能够处理的数据的数据量不同,基于目标运算核能够处理的数据的数据量较大。The preset order in which the CPU sends target computing cores to the GPU is the same as the running order of each computing core represented by the computing core diagram. The data processing links corresponding to the target computing core and the computing core included in the aforementioned computing core diagram are the same. The only difference is that the corresponding data volume that can be processed is different. The data volume based on the data that the target computing core can process is larger.
具体的,每当CPU向GPU发送一个目标运算核,GPU便可以运行该目标运算核完成数据处理,CPU按照预设顺序向GPU发送各个目标运算核,GPU便可以按照接收到目标运算核的顺序依次运行各个目标运算核,完成目标神经网络模型的推理过程。Specifically, whenever the CPU sends a target computing core to the GPU, the GPU can run the target computing core to complete data processing. The CPU sends each target computing core to the GPU in a preset order, and the GPU can receive the target computing core in the order in which it is received. Run each target computing core in sequence to complete the reasoning process of the target neural network model.
S605:接收上述GPU反馈的模型推理结果。S605: Receive the model inference results fed back by the above GPU.
由以上可见,本公开实施例构建的运算核图对应的计算规模无需过大,在运算核图对应的计算规模小于待处理数据的数据量,导致GPU无法基于运算核图完成目标神经网络 模型的模型推理的情况下,本实施例不限定仅能基于运算核图实现模型推理,而是可以通过CPU依次向GPU发送各个目标运算核的方式保证模型推理过程能够正常实现。It can be seen from the above that the calculation scale corresponding to the operation core graph constructed by the embodiment of the present disclosure does not need to be too large. The calculation scale corresponding to the operation core graph is smaller than the amount of data to be processed, resulting in the GPU being unable to complete the target neural network model based on the operation core graph. In the case of model inference, this embodiment is not limited to only realizing model inference based on the computing core graph. Instead, the CPU can send each target computing core to the GPU in sequence to ensure that the model inference process can be realized normally.
与上述应用于GPU的模型推理方法相对应的,本公开实施例还提供了一种模型推理装置。Corresponding to the above model reasoning method applied to GPU, embodiments of the present disclosure also provide a model reasoning device.
参见图7,为本公开实施例提供的第一种模型推理装置的结构示意图,应用于GPU,上述装置包括以下模块701-703。Referring to Figure 7, it is a schematic structural diagram of a first model inference device provided by an embodiment of the present disclosure, which is applied to a GPU. The above device includes the following modules 701-703.
运算核图接收模块701,用于接收CPU发送的目标神经网络模型对应的运算核图,其中,所述运算核图中的各个节点分别对应目标神经网络模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序;The computing core graph receiving module 701 is used to receive the computing core graph corresponding to the target neural network model sent by the CPU, wherein each node in the computing core graph corresponds to each computing core included in the target neural network model, and each edge The direction of is used to indicate the running order of the computing cores corresponding to the nodes connected by the edge;
模型推理模块702,用于在接收到所述CPU发送的待处理数据后,按照所述运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程;The model reasoning module 702 is configured to, after receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the target neural network. The reasoning process of the network model;
结果反馈模块703,用于向所述CPU反馈模型推理结果。 Result feedback module 703 is used to feed back model inference results to the CPU.
由以上可见,本公开实施例提供的方案中GPU预先获得了CPU发送的目标神经网络模型对应的运算核图,上述运算核图中包含有目标神经网络模型中包含的各个运算核,并能表示目标神经网络模型中各个运算核的运行顺序。则在GPU接收到CPU发送的待处理数据后,可以调用上述运算核图,按照上述运算核图表示的运行顺序依次运行各个运算核,对上述待处理数据进行处理,完成模型推理过程。与现有技术中CPU将各个运算核依次发送至GPU的方式相比,本实施例中CPU与GPU之间通过发送一次运算核图,CPU便可以把各个运算核均发送至GPU。在后续GPU接收到CPU发送的待处理数据后,GPU可以直接基于运算核图进行模型推理,CPU与GPU之间便不需要再交互运算核,本实施例中CPU与GPU之间进行交互的次数较少,从而可以降低CPU与GPU之间的交互对GPU模型推理的影响,进而可以提高GPU的模型推理效率。It can be seen from the above that in the solution provided by the embodiment of the present disclosure, the GPU obtains in advance the computing core diagram corresponding to the target neural network model sent by the CPU. The above computing core diagram contains each computing core included in the target neural network model and can represent The running order of each operation core in the target neural network model. After the GPU receives the data to be processed sent by the CPU, it can call the above-mentioned computing core graph, run each computing core in sequence according to the running order represented by the above-mentioned computing core graph, process the above-mentioned data to be processed, and complete the model inference process. Compared with the prior art method in which the CPU sends each computing core to the GPU in sequence, in this embodiment, by sending the computing core map once between the CPU and the GPU, the CPU can send each computing core to the GPU. After the subsequent GPU receives the data to be processed sent by the CPU, the GPU can directly perform model inference based on the computing core graph, and there is no need to interact with the computing cores between the CPU and the GPU. In this embodiment, the number of interactions between the CPU and the GPU Less, which can reduce the impact of the interaction between the CPU and the GPU on the GPU model inference, thereby improving the efficiency of the GPU model inference.
本公开的一个实施例中,上述运算核图接收模块701,具体用于:In one embodiment of the present disclosure, the above-mentioned computing core graph receiving module 701 is specifically used for:
接收CPU发送的目标神经网络模型对应的多个运算核图,其中,不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据量;Receive multiple computing core graphs corresponding to the target neural network model sent by the CPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates that the computing cores contained in the computing core graph can process The amount of data;
所述模型推理模块702,具体用于:The model reasoning module 702 is specifically used for:
在接收到所述CPU发送的待处理数据后,基于所述待处理数据的第一数据量,从各个运算核图中选择第一运算核图,其中,所述第一运算核图为所对应的计算规模不小于上述第一数据量且最接近于所述第一数据量的运算核图;After receiving the data to be processed sent by the CPU, based on the first data amount of the data to be processed, a first computing core graph is selected from each computing core graph, wherein the first computing core graph is the corresponding The calculation scale is not less than the above-mentioned first data amount and is closest to the calculation core graph of the first data amount;
按照所述第一运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程。Run each computing core sequentially according to the running order represented by the first computing core diagram, process the data to be processed, and complete the reasoning process of the target neural network model.
由以上可见,本公开实施例中预先构建了多个运算核图,在进行模型推理的过程中,GPU选择所对应的计算规模大于等于且最接近于第一数据量的运算核图进行模型推理,使得GPU基于所选择的运算核图能够对待处理数据进行处理的情况下,处理过程中所处理的数据的数据量最小,从而能够节省GPU的数据处理资源。It can be seen from the above that in the embodiment of the present disclosure, multiple computing core graphs are pre-constructed. During the process of model inference, the GPU selects the computing core graph whose corresponding computing scale is greater than or equal to the first data amount and is closest to the first data amount to perform model inference. , so that when the GPU can process the data to be processed based on the selected computing core map, the amount of data processed during the processing is minimal, thereby saving the data processing resources of the GPU.
参见图8,为本公开实施例提供的第二种模型推理装置的结构示意图,与前述图7所示的实施例相比,上述模型推理模块702,包括:Referring to Figure 8, which is a schematic structural diagram of a second model inference device provided by an embodiment of the present disclosure. Compared with the embodiment shown in Figure 7, the above model inference module 702 includes:
数据处理子模块702A,用于在接收到所述CPU发送的多个待处理数据后,将多个待处理数据合并为合并数据,调用所述运算核图,按照所述运算核图表示的运行顺序依次运行各个运算核,对所述合并数据进行处理,完成所述目标神经网络模型的推理过程,其中,所述多个待处理数据均为待通过所述目标神经网络模型进行处理的数据;The data processing sub-module 702A is configured to, after receiving multiple data to be processed sent by the CPU, merge the multiple data to be processed into merged data, call the operation core diagram, and execute the operation represented by the operation core diagram. Run each computing core in sequence to process the merged data to complete the reasoning process of the target neural network model, wherein the plurality of data to be processed are data to be processed by the target neural network model;
所述结果反馈模块703,包括:The result feedback module 703 includes:
结果反馈子模块703A,用于从所述合并数据的模型推理结果中提取各个待处理数据对应的模型推理结果,分别向所述CPU反馈各个待处理数据对应的模型推理结果。The result feedback sub-module 703A is configured to extract the model inference results corresponding to each data to be processed from the model inference results of the merged data, and feed back the model inference results corresponding to each data to be processed to the CPU respectively.
由以上可见,若存在待通过上述目标神经网络模型处理的多个待处理数据,GPU可以将各个待处理数据合并为一条合并数据,再调用运算核图对合并数据进行处理,相当于对各个待处理数据进行了统一的处理。在此过程中GPU仅需要调用一次运算核图便可以完成多个待处理数据的处理过程,与每对一个待处理数据进行处理便需要调用一次运算核图相比,本实施例中调用运算核图的次数较少,可以进一步提高GPU的模型推理效率。It can be seen from the above that if there are multiple data to be processed that need to be processed through the above target neural network model, the GPU can merge the data to be processed into one merged data, and then call the computing kernel graph to process the merged data, which is equivalent to processing each data to be processed. The data were processed uniformly. In this process, the GPU only needs to call the computing core map once to complete the processing of multiple to-be-processed data. Compared with the need to call the computing core map once to process each data to be processed, in this embodiment, the computing core map is called The number of graphs is smaller, which can further improve the model inference efficiency of the GPU.
本公开的一个实施例中,其中,所述运算核图接收模块701,具体用于:In one embodiment of the present disclosure, the computing core graph receiving module 701 is specifically used to:
接收CPU发送的目标神经网络模型对应的多个运算核图,其中,不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据量;Receive multiple computing core graphs corresponding to the target neural network model sent by the CPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates that the computing cores contained in the computing core graph can process The amount of data;
所述数据处理子模块702A,具体用于:The data processing sub-module 702A is specifically used for:
在接收到所述CPU发送的多个待处理数据后,将多个待处理数据合并为合并数据,基于第二数据量,从各个运算核图中选择第二运算核图,其中,所述第二数据量为:各个待处理数据的最大数据量与待处理数据数量的乘积,所述第二运算核图为:所对应的计算规模大于等于且最接近于所述第二数据量的运算核图;After receiving the plurality of data to be processed sent by the CPU, the plurality of data to be processed are merged into merged data, and based on the second data amount, a second computing core graph is selected from each computing core graph, wherein the third The second data volume is: the product of the maximum data volume of each data to be processed and the number of data to be processed. The second computing core diagram is: the computing core whose corresponding computing scale is greater than or equal to the second data volume and is closest to the second data volume. picture;
调用所述第二运算核图,按照所述第二运算核图表示的运行顺序依次运行各个运算核,对所述合并数据进行处理,完成所述目标神经网络模型的推理过程。The second computing core graph is called, each computing core is sequentially run according to the running order represented by the second computing core graph, the merged data is processed, and the reasoning process of the target neural network model is completed.
由以上可见,本公开实施例中预先构建了多个运算核图,在进行模型推理的过程中,GPU选择所对应的计算规模大于等于且最接近于第二数据量的第二运算核图,基于第二运算核图,GPU在能够对合并数据进行处理的情况下,处理过程中所处理的数据的数据量最小,从而能够节省GPU的数据处理资源。It can be seen from the above that in the embodiment of the present disclosure, multiple computing core graphs are pre-constructed. During the process of model inference, the GPU selects the second computing core graph whose corresponding computing scale is greater than or equal to the second data amount and is closest to the second data amount. Based on the second computing core graph, when the GPU can process the merged data, the amount of data processed during the processing is minimal, thereby saving the data processing resources of the GPU.
本公开的一个实施例中,其中,所述GPU完成所述目标神经网络模型推理的过程中所需的、预先分配的存储空间的大小,不小于第三数据量、第四数据量以及最大所需存储空间的大小之和;In one embodiment of the present disclosure, the size of the pre-allocated storage space required by the GPU to complete the inference of the target neural network model is not less than the third data amount, the fourth data amount and the maximum The sum of the required storage space;
其中,所述第三数据量为所述目标神经网络模型的数据量,所述第四数据量为基于各个运算核进行数据处理后得到的运算结果的数据量之和,所述最大所需存储空间为基于各个运算核进行数据处理的过程中所需的最大存储空间。Wherein, the third data amount is the data amount of the target neural network model, the fourth data amount is the sum of data amounts of operation results obtained after data processing based on each operation core, and the maximum required storage The space is the maximum storage space required for data processing based on each computing core.
由以上可见,上述预先分配的存储空间的大小大于等于第三数据量、第四数据量以及最大所需存储空间的大小之和能够使得GPU基于上述预先分配的存储空间能够正常地完 成模型推理的过程。It can be seen from the above that the size of the above-mentioned pre-allocated storage space is greater than or equal to the sum of the third data amount, the fourth data amount and the maximum required storage space, so that the GPU can normally complete model inference based on the above-mentioned pre-allocated storage space. process.
与上述应用于CPU的模型推理方法相对应,本公开实施例还提供了一种应用于CPU的模型推理装置。Corresponding to the above model reasoning method applied to the CPU, embodiments of the present disclosure also provide a model reasoning device applied to the CPU.
参见图9,为本公开实施例提供的第三种模型推理装置的结构示意图,上述装置包括以下模块901-903。Referring to Figure 9, which is a schematic structural diagram of a third model reasoning device provided by an embodiment of the present disclosure. The above device includes the following modules 901-903.
运算核图发送模块901,用于将预先构建的运算核图发送至图形处理器GPU,其中,所述运算核图中的各个节点分别对应目标模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序;The computing core graph sending module 901 is used to send the pre-constructed computing core graph to the graphics processor GPU, where each node in the computing core graph corresponds to each computing core included in the target model, and the direction of each edge Used to represent the running order of the computing cores corresponding to the nodes connected by the edge;
数据发送模块902,用于在存在目标处理请求的情况下,将待处理数据发送至所述GPU,以使得所述GPU按照所述运算核图表示的运行顺序依次运行各个运算核,基于显存中的预设存储空间,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程,并向所述CPU反馈模型推理结果,其中,所述目标处理需求为使用所述目标神经网络模型对待处理数据进行处理的请求;The data sending module 902 is used to send the data to be processed to the GPU when there is a target processing request, so that the GPU runs each computing core in sequence according to the running order represented by the computing core graph, based on the operation core in the video memory. The preset storage space is used to process the data to be processed, complete the inference process of the target neural network model, and feed back the model inference results to the CPU, where the target processing requirement is to use the target neural network Model requests to process data to be processed;
第一结果接收模块903,用于接收所述GPU反馈的模型推理结果。The first result receiving module 903 is used to receive the model inference result fed back by the GPU.
由以上可见,本公开实施例提供的方案中CPU将运算核图发送至GPU,GPU便可以按照上述运算核图依次运行各个运算核,对待处理数据进行处理,从而完成目标神经网络模型的模型推理过程。此过程中CPU仅需要向GPU发送一次运算核图,便可以使得GPU能够在后续基于所接收的运算核图完成模型推理过程。与现有技术中在进行模型推理的过程中CPU多次向GPU发送的各个运算核的方式相比,本实施例中CPU与GPU之间进行交互的次数较少,从而可以降低CPU与GPU之间的交互对GPU模型推理的影响,进而可以提高GPU的模型推理效率。It can be seen from the above that in the solution provided by the embodiment of the present disclosure, the CPU sends the computing core map to the GPU, and the GPU can run each computing core in sequence according to the above-mentioned computing core map to process the data to be processed, thereby completing the model inference of the target neural network model. process. In this process, the CPU only needs to send the computing core map to the GPU once, so that the GPU can subsequently complete the model inference process based on the received computing core map. Compared with the prior art method in which the CPU sends various computing cores to the GPU multiple times during model inference, the number of interactions between the CPU and the GPU in this embodiment is less, which can reduce the number of interactions between the CPU and the GPU. The impact of the interaction on GPU model inference can improve the efficiency of GPU model inference.
本公开的一个实施例中,所述运算核图发送模块901,具体用于:In one embodiment of the present disclosure, the computing core map sending module 901 is specifically used to:
将预先构建的各个运算核图发送至GPU,其中,不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示:该运算核图中包含的运算核能够处理的数据的数据量。Send each pre-built computing core graph to the GPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph represents: the data that the computing cores contained in the computing core graph can process. quantity.
由以上可见,本公开实施例中预先构建了多个运算核图,在进行模型推理的过程中,GPU可以基于实际需要处理的待处理数据的数据量,从各个运算核图中计算规模大于等于且最接近于上述数据量的运算核图进行数据处理,从而能够节省GPU的数据处理资源。It can be seen from the above that in the embodiment of the present disclosure, multiple computing core graphs are pre-constructed. During the process of model inference, the GPU can calculate the scale from each computing core graph based on the amount of data to be processed that actually needs to be processed. And the computing core graph closest to the above-mentioned data amount is used for data processing, thereby saving the data processing resources of the GPU.
本公开的一个实施例中,上述装置还包括:In one embodiment of the present disclosure, the above device further includes:
运算核发送模块,用于在确定预先构建的运算核图对应的计算规模小于所述待处理数据的数据量的情况下,向所述GPU发送待处理数据,并按照预设顺序将各个目标运算核发送至GPU,以使得所述GPU按照接收到各个目标运算核的顺序依次运行各个目标运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程,并向所述CPU反馈模型推理结果;The computing core sending module is used to send the data to be processed to the GPU when it is determined that the calculation scale corresponding to the pre-built computing core graph is smaller than the data volume of the data to be processed, and calculate each target in a preset order. The core is sent to the GPU, so that the GPU sequentially runs each target computing core in the order in which it receives each target computing core, processes the data to be processed, completes the inference process of the target neural network model, and reports to the CPU feedback model inference results;
其中,所述运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据量,所述预设顺序为所述目标神经网络模型规定的、各个目标运算核的执行顺序,所述目标运算核能够处理的数据量不小于所述待处理数据的数据量;Wherein, the calculation scale corresponding to the operation core graph represents the amount of data that the operation cores included in the operation core graph can process, and the preset order is the execution order of each target operation core specified by the target neural network model, The amount of data that the target computing core can process is not less than the amount of data to be processed;
第二结果接收模块,用于接收所述GPU反馈的模型推理结果。The second result receiving module is used to receive the model inference result fed back by the GPU.
由以上可见,本公开实施例构建的运算核图对应的计算规模无需过大,在运算核图对应的计算规模小于待处理数据的数据量,导致GPU无法基于运算核图完成目标神经网络模型的模型推理的情况下,本实施例不限定仅能基于运算核图实现模型推理,而是可以通过CPU依次向GPU发送各个目标运算核的方式保证模型推理过程能够正常实现。It can be seen from the above that the calculation scale corresponding to the operation core graph constructed by the embodiment of the present disclosure does not need to be too large. The calculation scale corresponding to the operation core graph is smaller than the amount of data to be processed, resulting in the GPU being unable to complete the target neural network model based on the operation core graph. In the case of model inference, this embodiment is not limited to only realizing model inference based on the computing core graph. Instead, the CPU can send each target computing core to the GPU in sequence to ensure that the model inference process can be realized normally.
本公开的技术方案中,所涉及的用户个人信息的收集、存储、使用、加工、传输、提供和公开等处理,均符合相关法律法规的规定,且不违背公序良俗。In the technical solution of this disclosure, the collection, storage, use, processing, transmission, provision and disclosure of user personal information are in compliance with relevant laws and regulations and do not violate public order and good customs.
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
本公开实施例提供了一种电子设备,包括:An embodiment of the present disclosure provides an electronic device, including:
至少一个CPU;以及at least one CPU; and
与所述至少一个CPU通信连接的存储器;其中,A memory communicatively connected to the at least one CPU; wherein,
所述存储器存储有可被所述至少一个CPU执行的指令,所述指令被所述至少一个CPU执行,以使所述至少一个CPU能够执行应用于GPU的模型推理方法中任一项所述的方法步骤。The memory stores instructions that can be executed by the at least one CPU, and the instructions are executed by the at least one CPU to enable the at least one CPU to execute any one of the model inference methods applied to the GPU. Method steps.
本公开实施例提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行应用于CPU的模型推理方法与应用于GPU的模型推理方法。Embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute a model inference method applied to a CPU and a model inference method applied to a GPU.
本公开实施例提供了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现应用于CPU的模型推理方法与应用于GPU的模型推理方法。Embodiments of the present disclosure provide a computer program product, including a computer program that, when executed by a processor, implements a model inference method applied to a CPU and a model inference method applied to a GPU.
图10示出了可以用来实施本公开的实施例的示例电子设备1000的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。Figure 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
如图10所示,设备1000包括GPU1001,其可以根据存储在只读存储器(ROM)1002中的计算机程序或者从存储单元1008加载到随机访问存储器(RAM)1003中的计算机程序,来执行各种适当的动作和处理。在RAM 1003中,还可存储设备1000操作所需的各种程序和数据。GPU1001、ROM 1002以及RAM 1003通过总线1004彼此相连。输入/输出(I/O)接口1005也连接至总线1004。As shown in FIG. 10, the device 1000 includes a GPU 1001, which can execute various tasks according to a computer program stored in a read-only memory (ROM) 1002 or loaded from a storage unit 1008 into a random access memory (RAM) 1003. Proper action and handling. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. GPU 1001, ROM 1002 and RAM 1003 are connected to each other through bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
设备1000中的多个部件连接至I/O接口1005,包括:输入单元1006,例如键盘、鼠标等;输出单元1007,例如各种类型的显示器、扬声器等;存储单元1008,例如磁盘、光盘等;以及通信单元1009,例如网卡、调制解调器、无线通信收发机等。通信单元1009允许设备1000通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the device 1000 are connected to the I/O interface 1005, including: input unit 1006, such as a keyboard, mouse, etc.; output unit 1007, such as various types of displays, speakers, etc.; storage unit 1008, such as a magnetic disk, optical disk, etc. ; and communication unit 1009, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
GPU1001可以是各种具有处理和计算能力的通用和/或专用处理组件。GPU1001执行上文所描述的各个方法和处理,例如模型推理方法。例如,在一些实施例中,语音翻译方法、模型训练方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存 储单元1008。在一些实施例中,计算机程序的部分或者全部可以经由ROM 1002和/或通信单元1009而被载入和/或安装到设备1000上。当计算机程序加载到RAM 1003并由GPU1001执行时,可以执行上文描述的模型推理方法的一个或多个步骤。备选地,在其他实施例中,GPU1001可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行模型推理方法。 GPU 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. GPU 1001 performs various methods and processes described above, such as model inference methods. For example, in some embodiments, the speech translation method and the model training method can be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When a computer program is loaded into RAM 1003 and executed by GPU 1001, one or more steps of the model inference method described above may be performed. Alternatively, in other embodiments, GPU 1001 may be configured to perform model inference methods in any other suitable manner (eg, via firmware).
图11示出了可以用来实施本公开的另一种实施例的示例电子设备1100的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。Figure 11 shows a schematic block diagram of an example electronic device 1100 that may be used to implement another embodiment of the present disclosure. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
如图11所示,设备1100包括CPU1101,其可以根据存储在只读存储器(ROM)1102中的计算机程序或者从存储单元1108加载到随机访问存储器(RAM)1103中的计算机程序,来执行各种适当的动作和处理。在RAM 1103中,还可存储设备1100操作所需的各种程序和数据。CPU1101、ROM 1102以及RAM 1103通过总线1104彼此相连。输入/输出(I/O)接口1105也连接至总线1104。As shown in FIG. 11 , the device 1100 includes a CPU 1101 that can execute various functions according to a computer program stored in a read-only memory (ROM) 1102 or loaded from a storage unit 1108 into a random access memory (RAM) 1103 . Proper action and handling. In the RAM 1103, various programs and data required for the operation of the device 1100 can also be stored. CPU 1101, ROM 1102 and RAM 1103 are connected to each other through bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.
设备1100中的多个部件连接至I/O接口1105,包括:输入单元1106,例如键盘、鼠标等;输出单元1107,例如各种类型的显示器、扬声器等;存储单元1108,例如磁盘、光盘等;以及通信单元1109,例如网卡、调制解调器、无线通信收发机等。通信单元1109允许设备1100通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the device 1100 are connected to the I/O interface 1105, including: input unit 1106, such as a keyboard, mouse, etc.; output unit 1107, such as various types of displays, speakers, etc.; storage unit 1108, such as a magnetic disk, optical disk, etc. ; and communication unit 1109, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
CPU1101可以是各种具有处理和计算能力的通用和/或专用处理组件。CPU1101执行上文所描述的各个方法和处理,例如模型推理方法。例如,在一些实施例中,语音翻译方法、模型训练方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元1108。在一些实施例中,计算机程序的部分或者全部可以经由ROM 1102和/或通信单元1109而被载入和/或安装到设备1100上。当计算机程序加载到RAM 1103并由CPU1101执行时,可以执行上文描述的模型推理方法的一个或多个步骤。备选地,在其他实施例中,CPU1101可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行模型推理方法。 CPU 1101 may be a variety of general and/or special purpose processing components having processing and computing capabilities. The CPU 1101 executes various methods and processes described above, such as the model inference method. For example, in some embodiments, the speech translation method and the model training method can be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When the computer program is loaded into RAM 1103 and executed by CPU 1101, one or more steps of the model inference method described above may be performed. Alternatively, in other embodiments, the CPU 1101 may be configured to perform the model inference method in any other suitable manner (eg, by means of firmware).
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。 这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of this disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,也可以为分布式系统的服务器,或者是结合了区块链的服务器。Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, a distributed system server, or a server combined with a blockchain.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in the present disclosure can be executed in parallel, sequentially, or in a different order. As long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, there is no limitation here.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the scope of the present disclosure. It will be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions are possible depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this disclosure shall be included in the protection scope of this disclosure.

Claims (20)

  1. 一种模型推理方法,应用于图形处理器GPU,包括:A model inference method applied to graphics processors GPU, including:
    接收CPU发送的目标神经网络模型对应的运算核图,其中,所述运算核图中的各个节点分别对应目标神经网络模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序;Receive the computing core graph corresponding to the target neural network model sent by the CPU, where each node in the computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to represent the connection to which the edge is connected. The running order of the computing cores corresponding to the nodes;
    在接收到所述CPU发送的待处理数据后,按照所述运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程;After receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the inference process of the target neural network model;
    向所述CPU反馈模型推理结果。Feed back the model inference results to the CPU.
  2. 根据权利要求1所述的方法,其中,所述接收CPU发送的目标神经网络模型对应的运算核图,包括:The method according to claim 1, wherein the computing kernel graph corresponding to the target neural network model sent by the receiving CPU includes:
    接收CPU发送的目标神经网络模型对应的多个运算核图,其中,不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据量;Receive multiple computing core graphs corresponding to the target neural network model sent by the CPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates that the computing cores contained in the computing core graph can process The amount of data;
    所述按照所述运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理,包括:Running each computing core sequentially according to the running order represented by the computing core diagram to process the data to be processed includes:
    基于所述待处理数据的第一数据量,从各个运算核图中选择第一运算核图,其中,所述第一运算核图为所对应的计算规模不小于所述第一数据量且最接近于所述第一数据量的运算核图;Based on the first data amount of the data to be processed, select a first operation core graph from each operation core graph, wherein the corresponding calculation scale of the first operation core graph is not less than the first data amount and the maximum A computing core graph close to the first data amount;
    按照所述第一运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理。Each computing core is sequentially run according to the running order represented by the first computing core diagram, and the data to be processed is processed.
  3. 根据权利要求1所述的方法,其中,所述在接收到所述CPU发送的待处理数据后,按照所述运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理,包括:The method according to claim 1, wherein after receiving the data to be processed sent by the CPU, each computing core is sequentially run according to the running order represented by the computing core diagram to process the data to be processed. ,include:
    在接收到所述CPU发送的多个待处理数据后,将多个待处理数据合并为合并数据,调用所述运算核图,按照所述运算核图表示的运行顺序依次运行各个运算核,对所述合并数据进行处理,其中,所述多个待处理数据均为待通过所述目标神经网络模型进行处理的数据;After receiving multiple data to be processed sent by the CPU, the multiple data to be processed are merged into merged data, the computing core graph is called, and each computing core is sequentially run according to the running order represented by the computing core graph. The merged data is processed, wherein the plurality of data to be processed are data to be processed by the target neural network model;
    所述向所述CPU反馈模型推理结果,包括:The feedback of model inference results to the CPU includes:
    从所述合并数据的模型推理结果中提取各个待处理数据对应的模型推理结果,分别向所述CPU反馈各个待处理数据对应的模型推理结果。The model inference results corresponding to each data to be processed are extracted from the model inference results of the merged data, and the model inference results corresponding to each data to be processed are fed back to the CPU respectively.
  4. 根据权利要求3所述的方法,其中,所述接收CPU发送的目标神经网络模型对应的运算核图,包括:The method according to claim 3, wherein the operation kernel graph corresponding to the target neural network model sent by the receiving CPU includes:
    接收CPU发送的目标神经网络模型对应的多个运算核图,其中,不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据量;Receive multiple computing core graphs corresponding to the target neural network model sent by the CPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates that the computing cores contained in the computing core graph can process The amount of data;
    所述调用所述运算核图,按照所述运算核图表示的运行顺序依次运行各个运算核,对所述合并数据进行处理,包括:The method of calling the operation core graph, running each operation core sequentially according to the running order represented by the operation core graph, and processing the merged data includes:
    基于第二数据量,从各个运算核图中选择第二运算核图,其中,所述第二数据量为:各个待处理数据的最大数据量与待处理数据数量的乘积,所述第二运算核图为:所对应的计算规模大于等于且最接近于所述第二数据量的运算核图;Based on the second data amount, select a second operation core graph from each operation core graph, wherein the second data amount is: the product of the maximum data amount of each data to be processed and the number of data to be processed, and the second operation core graph is The kernel graph is: a computing core graph whose corresponding calculation scale is greater than or equal to and closest to the second data amount;
    调用所述第二运算核图,按照所述第二运算核图表示的运行顺序依次运行各个运算核,对所述合并数据进行处理。The second computing core graph is called, each computing core is sequentially run according to the running order represented by the second computing core graph, and the merged data is processed.
  5. 根据权利要求1-4中任一项所述的方法,其中,所述GPU完成所述目标神经网络模型推理的过程中所需的、预先分配的存储空间的大小,不小于第三数据量、第四数据量以及最大所需存储空间的大小之和;The method according to any one of claims 1 to 4, wherein the size of the pre-allocated storage space required by the GPU to complete the inference of the target neural network model is not less than a third amount of data, The fourth amount of data and the sum of the maximum required storage space;
    其中,所述第三数据量为所述目标神经网络模型的数据量,所述第四数据量为基于各个运算核进行数据处理后得到的运算结果的数据量之和,所述最大所需存储空间为基于各个运算核进行数据处理的过程中所需的最大存储空间。Wherein, the third data amount is the data amount of the target neural network model, the fourth data amount is the sum of data amounts of operation results obtained after data processing based on each operation core, and the maximum required storage The space is the maximum storage space required for data processing based on each computing core.
  6. 一种模型推理方法,应用于CPU,包括:A model inference method, applied to CPU, including:
    将预先构建的运算核图发送至图形处理器GPU,其中,所述运算核图中的各个节点分别对应目标神经网络模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序;Send the pre-constructed computing core graph to the graphics processor GPU, where each node in the computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to represent the connection to which the edge is connected. The running order of the computing cores corresponding to the nodes;
    在存在目标处理请求的情况下,将待处理数据发送至所述GPU,以使得所述GPU按照所述运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程,并向所述CPU反馈模型推理结果,其中,所述目标处理需求为使用所述目标神经网络模型对待处理数据进行处理的请求;When there is a target processing request, the data to be processed is sent to the GPU, so that the GPU sequentially runs each computing core according to the running order represented by the computing core diagram, and processes the data to be processed, and completes The inference process of the target neural network model, and feedback of the model inference results to the CPU, wherein the target processing requirement is a request to use the target neural network model to process the data to be processed;
    接收所述GPU反馈的模型推理结果。Receive model inference results fed back by the GPU.
  7. 根据权利要求6所述的方法,其中,所述将预先构建的运算核图发送至GPU,包括:The method according to claim 6, wherein sending the pre-constructed computing kernel graph to the GPU includes:
    将预先构建的各个运算核图发送至GPU,其中,不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示:该运算核图中包含的运算核能够处理的数据量。Each pre-constructed computing core graph is sent to the GPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates: the amount of data that the computing cores included in the computing core graph can process.
  8. 根据权利要求6或7所述的方法,其中,在所述将预先构建的运算核图发送至图形处理器GPU后,还包括:The method according to claim 6 or 7, wherein after sending the pre-constructed computing kernel graph to the graphics processor GPU, it further includes:
    在确定预先构建的运算核图对应的计算规模小于所述待处理数据的数据量的情况下,向所述GPU发送待处理数据,并按照预设顺序将各个目标运算核发送至GPU,以使得所述GPU按照接收到各个目标运算核的顺序依次运行各个目标运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程,并向所述CPU反馈模型推理结果;When it is determined that the calculation scale corresponding to the pre-built computing core graph is smaller than the data amount of the data to be processed, the data to be processed is sent to the GPU, and each target computing core is sent to the GPU in a preset order, so that The GPU sequentially runs each target computing core in the order in which it is received, processes the data to be processed, completes the reasoning process of the target neural network model, and feeds back the model reasoning results to the CPU;
    其中,所述运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据量,所述预设顺序为所述目标神经网络模型规定的、各个目标运算核的执行顺序,所述目标运算核能够处理的数据量不小于所述待处理数据的数据量;Wherein, the calculation scale corresponding to the operation core graph represents the amount of data that the operation cores included in the operation core graph can process, and the preset order is the execution order of each target operation core specified by the target neural network model, The amount of data that the target computing core can process is not less than the amount of data to be processed;
    接收所述GPU反馈的模型推理结果。Receive model inference results fed back by the GPU.
  9. 一种模型推理装置,应用于图形处理器GPU,包括:A model reasoning device, applied to graphics processor GPU, including:
    运算核图接收模块,用于接收CPU发送的目标神经网络模型对应的运算核图,其中,所述运算核图中的各个节点分别对应目标神经网络模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序;The computing core graph receiving module is used to receive the computing core graph corresponding to the target neural network model sent by the CPU, wherein each node in the computing core graph corresponds to each computing core included in the target neural network model, and each edge The direction is used to indicate the running order of the computing cores corresponding to the nodes connected by the edge;
    模型推理模块,用于在接收到所述CPU发送的待处理数据后,按照所述运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程;The model reasoning module is used to, after receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the target neural network. The reasoning process of the model;
    结果反馈模块,用于向所述CPU反馈模型推理结果。A result feedback module is used to feed back model inference results to the CPU.
  10. 根据权利要求9所述的装置,其中,所述运算核图接收模块,具体用于:The device according to claim 9, wherein the computing core graph receiving module is specifically used for:
    接收CPU发送的目标神经网络模型对应的多个运算核图,其中,不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据量;Receive multiple computing core graphs corresponding to the target neural network model sent by the CPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates that the computing cores contained in the computing core graph can process The amount of data;
    所述模型推理模块,具体用于:The model inference module is specifically used for:
    在接收到所述CPU发送的待处理数据后,基于所述待处理数据的第一数据量,从各个运算核图中选择第一运算核图,其中,所述第一运算核图为所对应的计算规模不小于所述第一数据量且最接近于所述第一数据量的运算核图;After receiving the data to be processed sent by the CPU, based on the first data amount of the data to be processed, a first computing core graph is selected from each computing core graph, wherein the first computing core graph is the corresponding The calculation scale is not less than the first data amount and is closest to the calculation core graph of the first data amount;
    按照所述第一运算核图表示的运行顺序依次运行各个运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程。Run each computing core sequentially according to the running order represented by the first computing core diagram, process the data to be processed, and complete the reasoning process of the target neural network model.
  11. 根据权利要求9所述的装置,其中,所述模型推理模块,包括:The device according to claim 9, wherein the model inference module includes:
    数据处理子模块,用于在接收到所述CPU发送的多个待处理数据后,将多个待处理数据合并为合并数据,调用所述运算核图,按照所述运算核图表示的运行顺序依次运行各个运算核,对所述合并数据进行处理,完成所述目标神经网络模型的推理过程,其中,所述多个待处理数据均为待通过所述目标神经网络模型进行处理的数据;The data processing submodule is used to merge the multiple data to be processed into merged data after receiving multiple data to be processed sent by the CPU, call the operation core diagram, and follow the running sequence represented by the operation core diagram. Run each computing core in sequence to process the merged data to complete the reasoning process of the target neural network model, where the plurality of data to be processed are data to be processed by the target neural network model;
    所述结果反馈模块,包括:The result feedback module includes:
    结果反馈子模块,用于从所述合并数据的模型推理结果中提取各个待处理数据对应的模型推理结果,分别向所述CPU反馈各个待处理数据对应的模型推理结果。The result feedback submodule is used to extract the model inference results corresponding to each data to be processed from the model inference results of the merged data, and feed back the model inference results corresponding to each data to be processed to the CPU respectively.
  12. 根据权利要求11所述的装置,其中,所述运算核图接收模块,具体用于:The device according to claim 11, wherein the computing core graph receiving module is specifically used for:
    接收CPU发送的目标神经网络模型对应的多个运算核图,其中,不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据量;Receive multiple computing core graphs corresponding to the target neural network model sent by the CPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates that the computing cores contained in the computing core graph can process The amount of data;
    所述数据处理子模块,具体用于:The data processing sub-module is specifically used for:
    在接收到所述CPU发送的多个待处理数据后,将多个待处理数据合并为合并数据,基于第二数据量,从各个运算核图中选择第二运算核图,其中,所述第二数据量为:各个待处理数据的最大数据量与待处理数据数量的乘积,所述第二运算核图为:所对应的计算规模大于等于且最接近于所述第二数据量的运算核图;After receiving a plurality of data to be processed sent by the CPU, the plurality of data to be processed are merged into merged data, and based on the second data amount, a second computing core graph is selected from each computing core graph, wherein the third computing core graph is The second data volume is: the product of the maximum data volume of each data to be processed and the number of data to be processed. The second computing core diagram is: the computing core whose corresponding computing scale is greater than or equal to the second data volume and is closest to the second data volume. picture;
    调用所述第二运算核图,按照所述第二运算核图表示的运行顺序依次运行各个运算核,对所述合并数据进行处理,完成所述目标神经网络模型的推理过程。The second computing core graph is called, each computing core is sequentially run according to the running order represented by the second computing core graph, the merged data is processed, and the reasoning process of the target neural network model is completed.
  13. 根据权利要求9-12中任一项所述的装置,其中,所述GPU完成所述目标神经网络模型推理的过程中所需的、预先分配的存储空间的大小,不小于第三数据量、第四数据量以及最大所需存储空间的大小之和;The device according to any one of claims 9-12, wherein the size of the pre-allocated storage space required by the GPU to complete the inference of the target neural network model is not less than a third amount of data, The fourth amount of data and the sum of the maximum required storage space;
    其中,所述第三数据量为所述目标神经网络模型的数据量,所述第四数据量为基于各 个运算核进行数据处理后得到的运算结果的数据量之和,所述最大所需存储空间为基于各个运算核进行数据处理的过程中所需的最大存储空间。Wherein, the third data amount is the data amount of the target neural network model, the fourth data amount is the sum of data amounts of operation results obtained after data processing based on each operation core, and the maximum required storage The space is the maximum storage space required for data processing based on each computing core.
  14. 一种模型推理装置,应用于CPU,包括:A model inference device, applied to CPU, including:
    运算核图发送模块,用于将预先构建的运算核图发送至图形处理器GPU,其中,所述运算核图中的各个节点分别对应目标模型中包含的各个运算核,每一条边的方向用于表示该边所连接的节点对应的运算核的运行顺序;The computing core graph sending module is used to send the pre-constructed computing core graph to the graphics processor GPU, where each node in the computing core graph corresponds to each computing core included in the target model, and the direction of each edge is represented by Yu represents the running order of the computing cores corresponding to the nodes connected by the edge;
    数据发送模块,用于在存在目标处理请求的情况下,将待处理数据发送至所述GPU,以使得所述GPU按照所述运算核图表示的运行顺序依次运行各个运算核,基于显存中的预设存储空间,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程,并向所述CPU反馈模型推理结果,其中,所述目标处理需求为使用所述目标神经网络模型对待处理数据进行处理的请求;A data sending module, configured to send the data to be processed to the GPU when there is a target processing request, so that the GPU runs each computing core in sequence according to the running order represented by the computing core diagram, based on the operation core in the video memory. Preset storage space, process the data to be processed, complete the inference process of the target neural network model, and feed back the model inference results to the CPU, where the target processing requirement is to use the target neural network model Requests for processing of pending data;
    第一结果接收模块,用于接收所述GPU反馈的模型推理结果。The first result receiving module is used to receive the model inference result fed back by the GPU.
  15. 根据权利要求14所述的装置,其中,所述运算核图发送模块,具体用于:The device according to claim 14, wherein the computing core map sending module is specifically used for:
    将预先构建的各个运算核图发送至GPU,其中,不同运算核图对应的计算规模不同,每一运算核图对应的计算规模表示:该运算核图中包含的运算核能够处理的数据量。Each pre-constructed computing core graph is sent to the GPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates: the amount of data that the computing cores included in the computing core graph can process.
  16. 根据权利要求14或15所述的装置,还包括:The device of claim 14 or 15, further comprising:
    运算核发送模块,用于在确定预先构建的运算核图对应的计算规模小于所述待处理数据的数据量的情况下,向所述GPU发送待处理数据,并按照预设顺序将各个目标运算核发送至GPU,以使得所述GPU按照接收到各个目标运算核的顺序依次运行各个目标运算核,对所述待处理数据进行处理,完成所述目标神经网络模型的推理过程,并向所述CPU反馈模型推理结果;The computing core sending module is used to send the data to be processed to the GPU when it is determined that the calculation scale corresponding to the pre-built computing core graph is smaller than the data volume of the data to be processed, and calculate each target in a preset order. The core is sent to the GPU, so that the GPU sequentially runs each target computing core in the order in which it receives each target computing core, processes the data to be processed, completes the inference process of the target neural network model, and reports to the CPU feedback model inference results;
    其中,所述运算核图对应的计算规模表示该运算核图中包含的运算核能够处理的数据量,所述预设顺序为所述目标神经网络模型规定的、各个目标运算核的执行顺序,所述目标运算核能够处理的数据量不小于所述待处理数据的数据量;Wherein, the calculation scale corresponding to the operation core graph represents the amount of data that the operation cores included in the operation core graph can process, and the preset order is the execution order of each target operation core specified by the target neural network model, The amount of data that the target computing core can process is not less than the amount of data to be processed;
    第二结果接收模块,用于接收所述GPU反馈的模型推理结果。The second result receiving module is used to receive the model inference result fed back by the GPU.
  17. 一种电子设备,包括:An electronic device including:
    至少一个GPU;以及At least one GPU; and
    与所述至少一个GPU通信连接的存储器;其中,A memory communicatively connected to the at least one GPU; wherein,
    所述存储器存储有可被所述至少一个GPU执行的指令,所述指令被所述至少一个GPU执行,以使所述至少一个GPU能够执行权利要求1-5中任一项所述的方法。The memory stores instructions executable by the at least one GPU, and the instructions are executed by the at least one GPU to enable the at least one GPU to perform the method of any one of claims 1-5.
  18. 一种电子设备,包括:An electronic device including:
    至少一个CPU;以及at least one CPU; and
    与所述至少一个CPU通信连接的存储器;其中,A memory communicatively connected to the at least one CPU; wherein,
    所述存储器存储有可被所述至少一个CPU执行的指令,所述指令被所述至少一个CPU执行,以使所述至少一个CPU能够执行权利要求6-8中任一项所述的方法。The memory stores instructions executable by the at least one CPU, and the instructions are executed by the at least one CPU to enable the at least one CPU to perform the method of any one of claims 6-8.
  19. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行权利要求1-5或6-8中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in any one of claims 1-5 or 6-8.
  20. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现权利要求1-5或6-8中任一项所述的方法。A computer program product, comprising a computer program that implements the method of any one of claims 1-5 or 6-8 when executed by a processor.
PCT/CN2022/115511 2022-04-26 2022-08-29 Model inference methods and apparatuses, devices, and storage medium WO2023206889A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210450393.0A CN114819084B (en) 2022-04-26 2022-04-26 Model reasoning method, device, equipment and storage medium
CN202210450393.0 2022-04-26

Publications (1)

Publication Number Publication Date
WO2023206889A1 true WO2023206889A1 (en) 2023-11-02

Family

ID=82507217

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/115511 WO2023206889A1 (en) 2022-04-26 2022-08-29 Model inference methods and apparatuses, devices, and storage medium

Country Status (2)

Country Link
CN (1) CN114819084B (en)
WO (1) WO2023206889A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114819084B (en) * 2022-04-26 2024-03-01 北京百度网讯科技有限公司 Model reasoning method, device, equipment and storage medium
CN115373861B (en) * 2022-10-26 2022-12-27 小米汽车科技有限公司 GPU resource scheduling method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108292241A (en) * 2015-10-28 2018-07-17 谷歌有限责任公司 Processing calculates figure
CN111814967A (en) * 2020-09-11 2020-10-23 鹏城实验室 Method, apparatus and storage medium for calculating inferential computation of neural network model
CN111899150A (en) * 2020-08-28 2020-11-06 Oppo广东移动通信有限公司 Data processing method and device, electronic equipment and storage medium
US11176449B1 (en) * 2020-05-15 2021-11-16 Edgecortix Pte. Ltd. Neural network accelerator hardware-specific division of inference into groups of layers
WO2022037490A1 (en) * 2020-08-21 2022-02-24 北京灵汐科技有限公司 Computation method and apparatus for neural network, and computer device and storage medium
CN114819084A (en) * 2022-04-26 2022-07-29 北京百度网讯科技有限公司 Model reasoning method, device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112825154A (en) * 2019-11-20 2021-05-21 阿里巴巴集团控股有限公司 Method and device for optimizing online reasoning in deep learning and computer storage medium
CN111309479B (en) * 2020-02-14 2023-06-06 北京百度网讯科技有限公司 Method, device, equipment and medium for realizing task parallel processing
CN111582459B (en) * 2020-05-18 2023-10-20 Oppo广东移动通信有限公司 Method for executing operation, electronic equipment, device and storage medium
CN111860820A (en) * 2020-07-31 2020-10-30 北京灵汐科技有限公司 Neural network operator dividing method and device and dividing equipment
CN114327844A (en) * 2020-09-29 2022-04-12 华为技术有限公司 Memory allocation method, related device and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108292241A (en) * 2015-10-28 2018-07-17 谷歌有限责任公司 Processing calculates figure
US11176449B1 (en) * 2020-05-15 2021-11-16 Edgecortix Pte. Ltd. Neural network accelerator hardware-specific division of inference into groups of layers
WO2022037490A1 (en) * 2020-08-21 2022-02-24 北京灵汐科技有限公司 Computation method and apparatus for neural network, and computer device and storage medium
CN111899150A (en) * 2020-08-28 2020-11-06 Oppo广东移动通信有限公司 Data processing method and device, electronic equipment and storage medium
CN111814967A (en) * 2020-09-11 2020-10-23 鹏城实验室 Method, apparatus and storage medium for calculating inferential computation of neural network model
CN114819084A (en) * 2022-04-26 2022-07-29 北京百度网讯科技有限公司 Model reasoning method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114819084A (en) 2022-07-29
CN114819084B (en) 2024-03-01

Similar Documents

Publication Publication Date Title
WO2023206889A1 (en) Model inference methods and apparatuses, devices, and storage medium
CN112561078B (en) Distributed model training method and related device
JP7454529B2 (en) Distributed model training device and method, electronic device, storage medium, and computer program
US11651198B2 (en) Data processing method and apparatus for neural network
US20220391780A1 (en) Method of federated learning, electronic device, and storage medium
US20230005171A1 (en) Visual positioning method, related apparatus and computer program product
KR20210151730A (en) Memory allocation method, device and electronic equipment
CN108920281A (en) Extensive image processing method and system
KR102686643B1 (en) Applet page rendering methods, devices, electronic equipment and storage media
CN113344213A (en) Knowledge distillation method, knowledge distillation device, electronic equipment and computer readable storage medium
CN116820577B (en) Parallel processing method and device for model, first computing equipment and electronic equipment
WO2023221415A1 (en) Backbone network generation method and apparatus, device and storage medium
WO2023015942A1 (en) Image feature determination method and apparatus, electronic device, and storage medium
CN111767433A (en) Data processing method, device, storage medium and terminal
CN114817845B (en) Data processing method, device, electronic equipment and storage medium
CN114374703B (en) Cloud mobile phone information acquisition method, device, equipment and storage medium
US20220113943A1 (en) Method for multiply-add operations for neural network
US20220138528A1 (en) Data processing method for neural network accelerator, device and storage medium
CN115511085A (en) Model data processing method, device, equipment and storage medium
CN114386577A (en) Method, apparatus, and storage medium for executing deep learning model
CN113905040A (en) File transmission method, device, system, equipment and storage medium
CN113556575A (en) Method, apparatus, device, medium and product for compressing data
CN113836455A (en) Special effect rendering method, device, equipment, storage medium and computer program product
CN115495312B (en) Service request processing method and device
JP7343637B2 (en) Data processing methods, devices, electronic devices and storage media

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22939715

Country of ref document: EP

Kind code of ref document: A1