WO2023206889A1

WO2023206889A1 - Model inference methods and apparatuses, devices, and storage medium

Info

Publication number: WO2023206889A1
Application number: PCT/CN2022/115511
Authority: WO
Inventors: 潘能超; 王桂彬; 董昊; 王知践
Original assignee: 北京百度网讯科技有限公司
Priority date: 2022-04-26
Filing date: 2022-08-29
Publication date: 2023-11-02
Also published as: CN114819084A; CN114819084B

Abstract

The present disclosure provides model inference methods and apparatus, devices and a storage medium, and relates to the technical field of data processing, in particular to the technical field of artificial intelligence. A method is applied to a GPU, and has a specific implementation solution as follows: receiving an operation kernel graph corresponding to a target neural network model sent by a CPU, nodes in the operation kernel graph respectively corresponding to operation kernels contained in the target neural network model, and the direction of each edge being used for representing a running sequence of operation kernels corresponding to nodes connected to said edge; after data to be processed sent by the CPU is received, according to the running sequence represented by the operation kernel graph, successively running the operation kernels to process said data so as to complete an inference process of the target neural network model; and feeding a model inference result back to the CPU. When the solution provided by the embodiments of the present disclosure is applied to model inference, the model inference efficiency of the GPU can be improved.

Description

Model inference methods, devices, equipment and storage media

This disclosure claims priority to the Chinese patent disclosure with the application number 202210450393.0, which was submitted to the China Patent Office on April 26, 2022 and the invention is titled "Model Inference Method, Device, Equipment and Storage Medium", the entire content of which is incorporated herein by reference. Public.

Technical field

The present disclosure relates to the field of data processing technology, especially to the field of artificial intelligence technology, and further relates to model inference methods, devices, equipment and storage media.

Background technique

The model inference process of the neural network model can be composed of multiple different data processing links. Running different computing cores in the neural network model in sequence can complete different data processing links, thereby realizing the model inference process.

Contents of the invention

The present disclosure provides a model reasoning method, device, equipment and storage medium.

According to one aspect of the present disclosure, a model inference method is provided, applied to GPU, including:

Receive the computing core graph corresponding to the target neural network model sent by the CPU, where each node in the computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to represent the connection to which the edge is connected. The running order of the computing cores corresponding to the nodes;

After receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the inference process of the target neural network model;

Feed back the model inference results to the CPU.

According to another aspect of the present disclosure, a model inference method is provided, applied to a CPU, including:

Send the pre-constructed computing core graph to the GPU, where each node in the computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to indicate the corresponding node connection of the edge. The running sequence of the computing cores;

When there is a target processing request, the data to be processed is sent to the GPU, so that the GPU sequentially runs each computing core according to the running order represented by the computing core diagram, and processes the data to be processed, and completes The inference process of the target neural network model, and feedback of the model inference results to the CPU, wherein the target processing requirement is a request to use the target neural network model to process the data to be processed;

Receive model inference results fed back by the GPU.

According to another aspect of the present disclosure, a model inference device is provided, applied to a GPU, including:

The computing core graph receiving module is used to receive the computing core graph corresponding to the target neural network model sent by the CPU, wherein each node in the computing core graph corresponds to each computing core included in the target neural network model, and each edge The direction is used to indicate the running order of the computing cores corresponding to the nodes connected by the edge;

The model reasoning module is used to, after receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the target neural network. The reasoning process of the model;

A result feedback module is used to feed back model inference results to the CPU.

According to another aspect of the present disclosure, a model inference device is provided, applied to a CPU, including:

The computing core graph sending module is used to send the pre-constructed computing core graph to the graphics processor GPU, where each node in the computing core graph corresponds to each computing core included in the target model, and the direction of each edge is represented by Yu represents the running order of the computing cores corresponding to the nodes connected by the edge;

A data sending module, configured to send the data to be processed to the GPU when there is a target processing request, so that the GPU runs each computing core in sequence according to the running order represented by the computing core diagram, based on the operation core in the video memory. Preset storage space, process the data to be processed, complete the inference process of the target neural network model, and feed back the model inference results to the CPU, where the target processing requirement is to use the target neural network model Requests for processing of pending data;

The first result receiving module is used to receive the model inference result fed back by the GPU.

According to another aspect of the present disclosure, an electronic device is provided, including:

At least one GPU; and

A memory communicatively connected to the at least one GPU; wherein,

The memory stores instructions that can be executed by the at least one GPU, and the instructions are executed by the at least one GPU to enable the at least one GPU to execute any one of the model inference methods applied to the GPU. method.

at least one CPU; and

A memory communicatively connected to the at least one CPU; wherein,

The memory stores instructions that can be executed by the at least one CPU, and the instructions are executed by the at least one CPU to enable the at least one CPU to execute any one of the model inference methods applied to the CPU. method.

According to another aspect of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to cause the computer to execute a model inference method applied to a GPU or a model inference method applied to a CPU. The method described in any of the model inference methods.

According to another aspect of the present disclosure, a computer program product includes a computer program that, when executed by a processor, implements any of the model inference method applied to a GPU or the model inference method applied to a CPU. method.

It can be seen from the above that in the solution provided by the embodiment of the present disclosure, the GPU obtains in advance the computing core diagram corresponding to the target neural network model sent by the CPU. The above computing core diagram contains each computing core included in the target neural network model and can represent The running order of each operation core in the target neural network model. After the GPU receives the data to be processed sent by the CPU, it can call the above-mentioned computing core graph, run each computing core in sequence according to the running order represented by the above-mentioned computing core graph, process the above-mentioned data to be processed, and complete the model inference process. Compared with the prior art method in which the CPU sends each computing core to the GPU in sequence, in this embodiment, by sending the computing core map once between the CPU and the GPU, the CPU can send each computing core to the GPU. After the GPU subsequently receives the data to be processed sent by the CPU, the GPU can directly perform model inference based on the computing core graph, and there is no need to interact with the computing cores between the CPU and the GPU. In this embodiment, the number of interactions between the CPU and the GPU Less, which can reduce the impact of the interaction between the CPU and the GPU on the GPU model inference, thereby improving the efficiency of the GPU model inference.

It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.

Description of the drawings

In order to explain the embodiments of the present invention and the technical solutions of the prior art more clearly, the drawings needed to be used in the embodiments and the prior art are briefly introduced below. Obviously, the drawings in the following description are only for the purpose of explaining the embodiments of the present invention and the technical solutions of the prior art. For some embodiments of the invention, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

Figure 1 is a schematic flow chart of the first model reasoning method provided by an embodiment of the present disclosure;

Figure 2 is a schematic structural diagram of a computing core graph provided by an embodiment of the present disclosure;

Figure 3A is a schematic flowchart of the second model reasoning method provided by an embodiment of the present disclosure;

Figure 3B is a schematic diagram of the first computing core map selection process provided by an embodiment of the present disclosure;

Figure 4 is a schematic flow chart of a third model reasoning method provided by an embodiment of the present disclosure;

Figure 5A is a schematic flowchart of the fourth model reasoning method provided by an embodiment of the present disclosure;

Figure 5B is a schematic diagram of the second computing core map selection process provided by an embodiment of the present disclosure;

Figure 6A is a schematic flowchart of the fifth model reasoning method provided by an embodiment of the present disclosure;

Figure 6B is a schematic flowchart of the sixth model reasoning method provided by an embodiment of the present disclosure;

Figure 7 is a schematic structural diagram of a first model reasoning device provided by an embodiment of the present disclosure;

Figure 8 is a schematic structural diagram of a second model reasoning device provided by an embodiment of the present disclosure;

Figure 9 is a schematic structural diagram of a third model reasoning device provided by an embodiment of the present disclosure;

Figure 10 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure;

FIG. 11 is a schematic block diagram of another electronic device provided by an embodiment of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding and should be considered to be exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

First, the application scenarios of the embodiments of the present disclosure are described.

Embodiments of the present disclosure are applied to application scenarios where the CPU and GPU collaborate to perform model inference. Since the GPU processes data such as images, videos, 3D graphics, and audio very quickly, image recognition, voice interaction, and other data can be efficiently completed through the GPU. Image retrieval and other services. In this process, the GPU can complete the above business through the process of model reasoning based on the neural network model, and the CPU can send the computing cores included in the neural network model to the GPU, so that the GPU runs each computing core to complete the model reasoning process.

The above-mentioned CPU and GPU can run in the same electronic device, and the above-mentioned electronic device can be a computer, a mobile phone, a server, etc. Electronic devices equipped with the above-mentioned CPU and GPU can receive data processing requests sent by other devices. The above-mentioned data processing requests contain pending data that need to be processed to request the CPU and GPU to complete the model inference process.

The model inference method provided by the embodiment of the present disclosure is described in detail below.

Referring to Figure 1, a schematic flow chart of a first model inference method provided by an embodiment of the present disclosure is applied to a GPU. The above method includes the following steps S101-S103.

S101: Receive the computing core graph corresponding to the target neural network model sent by the CPU.

Each node in the above-mentioned computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to indicate the running order of the computing core corresponding to the node connected by the edge.

Specifically, after receiving the above-mentioned operation core map, the GPU can store the above-mentioned operation core map. Different computing cores correspond to different data processing links, and the GPU can implement different data processing links based on different computing cores. For example, the above-mentioned data processing links may include matrix multiplication calculations, data activation processing, data division calculations, etc.

In addition, the structure of the target neural network model is relatively fixed, that is, the execution order of each data processing link in the data processing process through the target neural network model is relatively fixed, and the running order of each computing core in the target neural network model is relatively fixed, so it can Pre-construct the operation core graph of the target neural network model.

The above-mentioned computing core graph can be constructed through the API (Application Programming Interface, application program interface) of CUDA (Compute Unified Device Architecture, unified computing device architecture), then the above-mentioned computing core graph can be called CUDA-Graph (Compute Unified Device Architecture) -Graph, Unified Computing Device Architecture Table).

Refer to Figure 2, which is a schematic structural diagram of a computing core graph provided by an embodiment of the present disclosure.

The computing core diagram of the target neural network model shown in Figure 2 contains 4 nodes in total, nodes 1-4, which correspond to computing cores 1-4 respectively. The arrows between the nodes indicate the operations between the nodes connected by the arrows corresponding to the computing cores. The order is: first execute the computing core corresponding to the node connected to the tail of the arrow, and then execute the computing core corresponding to the node connected to the head of the arrow. Operation cores 1-4 are used for matrix multiplication calculations, matrix addition calculations, matrix multiplication calculations, and convolution processing respectively. The operation core diagram shown in Figure 2 shows that the target neural network model first performs matrix multiplication calculation on the input data, then performs matrix addition calculation and matrix multiplication calculation respectively, and then calculates the calculation result of matrix addition calculation and matrix number. The calculation result of the multiplication calculation is convolved.

S102: After receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the inference process of the target neural network model.

Specifically, after the above-mentioned GPU receives the above-mentioned data to be processed, it can use the pre-allocated storage space to complete the process of model inference.

Wherein, the address of the above-mentioned pre-allocated storage space is a fixed address corresponding to the above-mentioned target neural network model, and the size of the pre-allocated storage space is a preset size corresponding to the above-mentioned target neural network model.

The size of the above-mentioned pre-allocated storage space may be set by the user based on experience, or may not be less than the sum of the third data amount, the fourth data amount, and the maximum required storage space.

Wherein, the above-mentioned third data amount is the data amount of the above-mentioned target neural network model, specifically, it can be the data amount of the model parameters of the above-mentioned target model, and the above-mentioned fourth data amount is the operation result obtained after data processing based on each operation core. The sum of the amount of data, the above-mentioned maximum required storage space is greater than or equal to the maximum storage space required by the GPU for data processing based on each computing core.

In one embodiment of the present disclosure, based on the calculation scale corresponding to the calculation core map, the calculation scale obtained by performing data processing on each calculation core in the calculation core map can be estimated through manual estimation or estimation by a pre-written estimation program. The amount of data in the operation results, and the amount of temporary storage space required by the GPU for data processing based on each operation core.

After the GPU completes data processing based on each computing core, it will store the processing results in the storage space. Therefore, different storage spaces need to be reserved for each computing core to store the processing results. The above-mentioned pre-allocated storage space needs to be able to accommodate each operation. The operation result of the core, that is, the size of the pre-allocated storage space needs to be greater than or equal to the sum of the data amounts of the operation results of each operation core, that is, greater than the above-mentioned second data amount.

In addition, the temporary storage space corresponding to each computing core is used to store the calculation intermediate values generated during the data processing process of the GPU based on the computing core. For each computing core, after the GPU completes the data processing process based on the computing core, The calculation intermediate values stored in the temporary storage space will be released, so the GPU can reuse the same temporary storage space during data processing based on different computing cores. The above temporary storage space needs to be able to accommodate the data processing process based on the computing cores. The temporary storage space that meets the above requirements can be called the maximum required storage space. The size of the above pre-allocated storage space needs to be greater than or equal to the size of the maximum required storage space.

Furthermore, the above-mentioned pre-allocated storage space also needs to be able to store the target neural network model.

The size of the above-mentioned pre-allocated storage space is greater than or equal to the sum of the third data amount, the fourth data amount and the maximum required storage space, so that the GPU can normally store the target neural network model based on the above-mentioned pre-allocated storage space. The calculation intermediate values generated when the computing core performs data processing and the computing results obtained after data processing based on each computing core are used to complete the process of model inference normally.

S103: Feed back the model inference results to the above-mentioned CPU.

It can be seen from the above that in the solution provided by the embodiment of the present disclosure, the GPU obtains in advance the computing core diagram corresponding to the target neural network model sent by the CPU. The above computing core diagram contains each computing core included in the target neural network model and can represent The running order of each operation core in the target neural network model. After the GPU receives the data to be processed sent by the CPU, it can call the above-mentioned computing core graph, run each computing core in sequence according to the running order represented by the above-mentioned computing core graph, process the above-mentioned data to be processed, and complete the model inference process. Compared with the prior art method in which the CPU sends each computing core to the GPU in sequence, in this embodiment, by sending the computing core map once between the CPU and the GPU, the CPU can send each computing core to the GPU. After the subsequent GPU receives the data to be processed sent by the CPU, the GPU can directly perform model inference based on the computing core graph, and there is no need to interact with the computing cores between the CPU and the GPU. In this embodiment, the number of interactions between the CPU and the GPU Less, which can reduce the impact of the interaction between the CPU and the GPU on the GPU model inference, thereby improving the efficiency of the GPU model inference.

Referring to Figure 3A, a schematic flow chart of the second model reasoning method provided by an embodiment of the present disclosure is provided. Specifically, step S101 can be implemented through the following step S101A, and the above-mentioned step S102 can be implemented through steps S102A-S102B.

First, the calculation scale of the operation core graph is explained:

Different computing core graphs correspond to different calculation scales. The calculation scale corresponding to each computing core graph indicates the amount of data that the computing core included in the computing core graph can process.

Furthermore, the amount of data that the GPU can process based on the computing core is a fixed data amount, and the above data amount can be called the computing scale corresponding to the computing core. Each computing core can be set to support mask operation. When the computing core processes data whose amount is smaller than its own computing scale, it can expand the data to be processed to its corresponding computing scale before processing the data, so that the GPU The computing core can be used to process data whose data volume is less than or equal to the computing scale corresponding to the computing core.

For example, if the data to be processed is a matrix, the computing core corresponds to a matrix with a calculation size of 50×50. When the GPU processes a matrix with a size of 30×30 based on the computing core, elements can be added to the matrix and the matrix The size is expanded to 50×50, and then processed, and the added elements are removed after the processing result is obtained.

In addition, the calculation scales corresponding to each computing core included in the above computing core graph can be the same or different. However, in order to enable each computing core included in the computing core graph to process data uniformly, you can select If the corresponding computing cores have the same computing scale, the computing scale of each computing core can be used as the computing scale corresponding to the computing core graph.

In order to enable the GPU to normally perform data processing based on the above-mentioned computing core graph, the calculation scale corresponding to the above-mentioned computing core graph can be set based on the data volume of historical data in the application scenario of the above-mentioned target model.

For example, the calculation scale corresponding to the computing core graph can be set to be greater than or equal to the maximum data volume of each historical data in the application scenario, so that the GPU can theoretically process all the data to be processed in the above application scenario based on the above computing core graph.

Alternatively, the maximum value of the data volume of the historical data may be multiplied by the first preset ratio to serve as the calculation scale corresponding to the computing kernel graph. For example, the above-mentioned first preset ratio may be 70%, 80%, etc., so that the GPU can process most of the data to be processed included in the above-mentioned application scenario based on the above-mentioned computing kernel map.

S101A: Receive multiple computing core maps corresponding to the target neural network model sent by the CPU.

Specifically, the computing cores recorded in different computing core maps are all computing cores included in the target model. The structures of different computing core maps are the same, and the only difference is that the calculation scales corresponding to different computing core maps are different.

In the embodiment of the present disclosure, there are multiple different computing core graphs, and the GPU can store the multiple received computing core graphs in the storage space, so that the stored computing core graphs can be directly called for model inference later.

In addition, the pre-allocated storage space can be a reusable storage space, that is, no matter which computing core map the CPU chooses to send to the GPU, the GPU can use the pre-allocated storage space in the process of model inference based on the received computing core graph. allocated storage space. Since the larger the calculation scale corresponding to the computing core graph used by the GPU in the process of model inference, the larger the amount of data processed in the process of model inference, and the greater the storage space required. Therefore, the above-mentioned pre-allocated storage If the space can meet the data storage requirements of the corresponding computing core graph with the largest calculation scale, it can be reused for other computing core graphs. Therefore, the size of the pre-allocated storage space can be determined based on the corresponding computing core graph with the largest calculation scale. For a specific method of determining the size of the pre-allocated storage space, please refer to the previous description in step S102, which will not be described again here.

S102A: Based on the first data amount of the data to be processed, select a first computing core graph from each computing core graph.

Wherein, the above-mentioned first computing core graph is a computing core graph whose corresponding calculation scale is not less than the above-mentioned first data amount and is closest to the above-mentioned first data amount.

In addition, refer to the previous description, the GPU can perform model inference on data whose data volume is less than or equal to the calculation scale corresponding to the operation core graph based on the operation core graph. During the process of model inference, the GPU can expand the data volume of the data to be processed. to the calculation scale corresponding to the computing core graph, and then proceed with processing. Therefore, the larger the calculation scale corresponding to the computing core graph, the greater the amount of data actually processed during model inference based on the computing core graph, and the greater the data processing resources consumed.

In the embodiment of the present disclosure, multiple computing core graphs corresponding to different calculation scales are pre-constructed. The CPU sends each computing core graph to the GPU in advance, and the GPU can complete the model inference process based on any one of the computing core graphs. Before processing the data to be processed, the GPU can select a computing core graph whose corresponding calculation scale is greater than or equal to the first data amount and is closest to the first data amount from multiple computing core graphs, so that the GPU can process the data based on the selected computing core graph. The data is processed, and during the processing, the amount of data that actually needs to be processed is minimal.

Among them, the calculation scale corresponding to each computing core graph can be any value. Specifically, the calculation scale corresponding to each computing core graph can be set based on the maximum data volume of each historical data in the application scenario of the target model.

In one embodiment of the present disclosure, the maximum value of the data volume of each historical data in the application scenario of the target model can be determined, multiplied by the maximum value with a different second preset ratio, and the obtained results are used as each computing core. The calculation scale corresponding to the graph.

For example, if the maximum data volume of each historical data in the application scenario of the target model is 80M, and the above-mentioned second preset proportions are 100%, 80%, and 60% respectively, then the calculation scale corresponding to each computing core graph can be set to 80M, 64M, 48M.

In addition, you can also set the maximum value of the calculation scale corresponding to each computing core graph based on the maximum value of the data volume of each historical data in the application scenario of the target model, and then calculate the quotient between the maximum value of the calculation scale and the number of computing core graphs. value, as the difference between the calculation scales corresponding to each operation core map, the calculation scale corresponding to each operation core map is set based on the above difference, so that the calculation scale corresponding to each operation core map is an arithmetic sequence.

For example, the maximum data volume of each historical data in the application scenario of the target model is 100M. You can set the maximum calculation scale corresponding to the computing core graph to 100M. If the number of computing core graphs is 10, the above quotient is 10M, the calculation scale corresponding to each computing core graph can be set to 10M, 20M, 30M, 40M, 50M, 60M, 70M, 80M, 90M, 100M respectively.

S102B: Run each computing core in sequence according to the running order represented by the first computing core diagram to process the data to be processed.

Specifically, step S102B is similar to the aforementioned step S102, and will not be described again here.

It can be seen from the above that in the embodiment of the present disclosure, multiple computing core graphs are pre-constructed. During the process of model inference, the GPU selects the computing core graph whose corresponding computing scale is greater than or equal to the first data amount and is closest to the first data amount to perform model inference. , so that when the GPU can process the data to be processed based on the selected computing core map, the amount of data processed during the processing is minimal, thereby saving the data processing resources of the GPU.

Refer to FIG. 3B , which is a schematic diagram of the first computing core map selection process provided by an embodiment of the present disclosure.

It includes input module and operation core diagram 1, operation core diagram 2 - operation core diagram n, a total of n operation core diagrams. The arrows between the input module and each operation core diagram indicate that the GPU can actually perform the first operation based on the input data to be processed. The amount of data is selected from the calculation core map 1, the calculation core map 2 - the calculation core map n, and the selected calculation core map is used for model inference. The arrows between each computing core graph and the pre-allocated storage space indicate that the GPU shares the same pre-allocated storage space during model inference based on each computing core graph.

Refer to Figure 4, which is a schematic flow chart of the third model reasoning method provided by the embodiment of the present disclosure. Compared with the aforementioned embodiment shown in Figure 1, the above step S102 can be realized by the following step S102C, and the above step S103 can be realized by the following steps. S103A implemented.

S102C: After receiving multiple to-be-processed data sent by the above-mentioned CPU, merge the multiple to-be-processed data into merged data, call the above-mentioned computing core graph, and run each computing core in sequence according to the running order represented by the above-mentioned computing core graph. The merged data are processed separately to complete the inference process of the above target neural network model.

Among them, the plurality of data to be processed are all data to be processed by the target neural network model. Therefore, the same computing core graph based on the target neural network model can process multiple data to be processed.

In one embodiment of the present disclosure, when the CPU receives multiple data processing requests, if there are multiple data processing requests requesting data processing through the target neural network model, the to-be-used data contained in the above data processing requests can be The processed data are jointly sent to the GPU, so that the GPU receives the above multiple data to be processed.

In addition, after receiving multiple data to be processed, the GPU can uniformly expand the data volume of each data to be processed to the maximum data volume of each data to be processed, and then merge the data to be processed into one piece of merged data. When processing the merged data, the GPU can only call the operation core graph once, and process the merged data based on the operation core graph, which is equivalent to completing the processing of multiple data to be processed.

Specifically, the process of the GPU processing the merged data is similar to the content shown in the previous step S102, and the way the GPU expands the data volume of the data to be processed is similar to the content shown in the previous step S102A, which will not be described in detail in this embodiment. .

S103A: Extract the model inference results corresponding to each data to be processed from the model inference results of the merged data, and feed back the model inference results corresponding to each data to be processed to the CPU respectively.

Specifically, according to the order of the data to be processed in the merged data, the processing results corresponding to each data to be processed can be extracted from the model inference results of the merged data, and then the processing results corresponding to the expanded data are removed from them, so as to obtain each Model inference results corresponding to the data to be processed.

It can be seen from the above that if there are multiple data to be processed that need to be processed through the above target neural network model, the GPU can merge the data to be processed into one merged data, and then call the computing kernel graph to process the merged data, which is equivalent to processing each data to be processed. The data were processed uniformly. In this process, the GPU only needs to call the computing core map once to complete the processing of multiple to-be-processed data. Compared with the need to call the computing core map once to process each data to be processed, in this embodiment, the computing core map is called The number of graphs is smaller, which can further improve the model inference efficiency of the GPU.

Refer to Figure 5A, which is a schematic flow chart of the fourth model reasoning method provided by an embodiment of the present disclosure. Compared with the aforementioned embodiment shown in Figure 4, the above step S101 can be implemented through the following steps S101B, and the above step S102C can be achieved through the following steps S102C1-S102C2.

S101B: Receive multiple computing core maps corresponding to the target neural network model sent by the CPU.

Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph represents the amount of data that the computing core included in the computing core graph can process.

Specifically, the above-mentioned step S101B is similar to the above-mentioned step S101A, which will not be described again in this embodiment.

S102C1: Based on the second data amount, select a second computing core graph from each computing core graph.

Wherein, the above-mentioned second data amount is: the product of the maximum data amount of each data to be processed and the number of data to be processed, and the above-mentioned second operation core diagram is: the corresponding calculation scale is greater than or equal to and closest to the above-mentioned second data amount. Operation core diagram.

Specifically, the data volume of the data to be processed is less than or equal to the maximum data volume of the data to be processed. After merging each data to be processed to obtain the merged data, the data volume of the merged data will not be greater than the product of the maximum data volume and the number of data to be processed. . The calculation scale corresponding to the selected second computing core graph is greater than or equal to the second data amount, so that the GPU can process the merged data based on the selected second computing core graph, and because the selected second computing core graph corresponds to The calculation scale is closest to the above-mentioned second data amount, so that the GPU processes the merged data based on the selected second computing core graph and consumes the least computing resources overall.

S102C2: Call the above-mentioned second computing core graph, run each computing core in sequence according to the running order represented by the above-mentioned second computing core graph, and process the above-mentioned merged data.

Specifically, the method of processing the merged data is similar to the content described in the aforementioned step S102, and will not be described again in this embodiment.

It can be seen from the above that in the embodiment of the present disclosure, multiple computing core graphs are pre-constructed. During the process of model inference, the GPU selects the second computing core graph whose corresponding computing scale is greater than or equal to the second data amount and is closest to the second data amount. Based on the second computing core graph, when the GPU can process the merged data, the amount of data processed during the processing is minimal, thereby saving the data processing resources of the GPU.

Refer to FIG. 5B , which is a schematic diagram of the second computing core map selection process provided by an embodiment of the present disclosure.

Compared with the aforementioned embodiment shown in Figure 3B, Figure 5B also contains data to be processed 1, data to be processed 2 - data to be processed m, a total of m data to be processed, and the above data to be processed are all to be passed through the target neural network model. In the data being processed, there are arrows between each data to be processed and the input, indicating that the GPU can process multiple data to be processed in a unified manner.

Referring to Figure 6A, a schematic flow chart of a fifth model reasoning method provided by an embodiment of the present disclosure is applied to a CPU. The above method includes the following steps S601-S603.

S601: Send the pre-built computing kernel graph to the GPU.

S602: When there is a target processing request, send the data to be processed to the above-mentioned GPU, so that the above-mentioned GPU sequentially runs each computing core according to the running order represented by the above-mentioned computing core diagram, processes the above-mentioned data to be processed, and completes the above-mentioned goal. The inference process of the neural network model and feeds back the model inference results to the above-mentioned CPU.

Wherein, the above target processing request is a request to process the data to be processed using the above target neural network model.

S603: Receive the model inference results fed back by the above GPU.

In one embodiment of the present disclosure, the above-mentioned steps S601-S603 are similar to the above-mentioned steps S101-S103, and the difference is only that the execution subject is different, which will not be described again here.

It can be seen from the above that in the solution provided by the embodiment of the present disclosure, the CPU sends the computing core map to the GPU, and the GPU can run each computing core in sequence according to the above-mentioned computing core map to process the data to be processed, thereby completing the model inference of the target neural network model. process. In this process, the CPU only needs to send the computing core map to the GPU once, so that the GPU can subsequently complete the model inference process based on the received computing core map. Compared with the prior art method in which the CPU sends various computing cores to the GPU multiple times during model inference, the number of interactions between the CPU and the GPU in this embodiment is less, which can reduce the number of interactions between the CPU and the GPU. The impact of the interaction on GPU model inference can improve the efficiency of GPU model inference.

The embodiment of the present disclosure can implement the above step S601 through the following step A.

Step A: Send the pre-built each computing core graph to the GPU.

Different computing core maps correspond to different calculation scales. The calculation scale corresponding to each computing core map indicates: the amount of data that the computing cores included in the computing core map can process.

In one embodiment of the present disclosure, after the GPU receives multiple computing core maps sent by the CPU, it can process the data to be processed according to the aforementioned steps S102A-S102B, which will not be described again here.

It can be seen from the above that in the embodiment of the present disclosure, multiple computing core graphs are pre-constructed. During the process of model inference, the GPU can calculate the scale from each computing core graph based on the amount of data to be processed that actually needs to be processed. And the computing core graph closest to the above-mentioned data amount is used for data processing, thereby saving the data processing resources of the GPU.

Referring to Figure 6B, a schematic flow chart of a sixth model inference method provided by an embodiment of the present disclosure is shown. Compared with the aforementioned embodiment shown in Figure 6A, the following steps S604-S605 are included after the above-mentioned step S601.

S604: When it is determined that the calculation scale corresponding to the pre-built computing core graph is smaller than the data volume of the above-mentioned data to be processed, send the data to be processed to the above-mentioned GPU, and send each target computing core to the GPU in a preset order, so that The GPU runs each target computing core in sequence in the order in which it is received, processes the data to be processed, completes the inference process of the target neural network model, and feeds back the model inference results to the CPU.

Wherein, the calculation scale corresponding to the above-mentioned computing core graph represents the amount of data that can be processed by the computing cores included in the computing core graph. The above-mentioned preset order is the execution order of each target computing core specified by the above-mentioned target neural network model. The above-mentioned The amount of data that the target computing core can process is not less than the amount of data to be processed.

Specifically, if the GPU receives the data to be processed and determines that the calculation scale corresponding to the operation core graph is greater than or equal to the data volume of the data to be processed, the GPU can process the data to be processed based on the above operation core graph. Otherwise, the GPU cannot process the data to be processed based on the above operation core graph. When the computing core map processes the data to be processed, the GPU can send a request to the CPU to indicate that it cannot process the data to be processed based on the computing core graph, so as to request the CPU to assist in completing the data processing process in other ways.

After receiving the above request, the CPU can determine that the calculation scale corresponding to the pre-built computing core graph is smaller than the data volume of the above-mentioned data to be processed, and then steps S604-S605 can be executed.

In one embodiment of the present disclosure, each target computing core corresponds to a different data processing link in the target neural network model. The GPU sequentially runs each target computing core to complete each data processing link in the target neural network model, thereby realizing the target neural network. The model inference process of the model.

The preset order in which the CPU sends target computing cores to the GPU is the same as the running order of each computing core represented by the computing core diagram. The data processing links corresponding to the target computing core and the computing core included in the aforementioned computing core diagram are the same. The only difference is that the corresponding data volume that can be processed is different. The data volume based on the data that the target computing core can process is larger.

Specifically, whenever the CPU sends a target computing core to the GPU, the GPU can run the target computing core to complete data processing. The CPU sends each target computing core to the GPU in a preset order, and the GPU can receive the target computing core in the order in which it is received. Run each target computing core in sequence to complete the reasoning process of the target neural network model.

S605: Receive the model inference results fed back by the above GPU.

It can be seen from the above that the calculation scale corresponding to the operation core graph constructed by the embodiment of the present disclosure does not need to be too large. The calculation scale corresponding to the operation core graph is smaller than the amount of data to be processed, resulting in the GPU being unable to complete the target neural network model based on the operation core graph. In the case of model inference, this embodiment is not limited to only realizing model inference based on the computing core graph. Instead, the CPU can send each target computing core to the GPU in sequence to ensure that the model inference process can be realized normally.

Corresponding to the above model reasoning method applied to GPU, embodiments of the present disclosure also provide a model reasoning device.

Referring to Figure 7, it is a schematic structural diagram of a first model inference device provided by an embodiment of the present disclosure, which is applied to a GPU. The above device includes the following modules 701-703.

The computing core graph receiving module 701 is used to receive the computing core graph corresponding to the target neural network model sent by the CPU, wherein each node in the computing core graph corresponds to each computing core included in the target neural network model, and each edge The direction of is used to indicate the running order of the computing cores corresponding to the nodes connected by the edge;

The model reasoning module 702 is configured to, after receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the target neural network. The reasoning process of the network model;

Result feedback module 703 is used to feed back model inference results to the CPU.

In one embodiment of the present disclosure, the above-mentioned computing core graph receiving module 701 is specifically used for:

Receive multiple computing core graphs corresponding to the target neural network model sent by the CPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates that the computing cores contained in the computing core graph can process The amount of data;

The model reasoning module 702 is specifically used for:

After receiving the data to be processed sent by the CPU, based on the first data amount of the data to be processed, a first computing core graph is selected from each computing core graph, wherein the first computing core graph is the corresponding The calculation scale is not less than the above-mentioned first data amount and is closest to the calculation core graph of the first data amount;

Run each computing core sequentially according to the running order represented by the first computing core diagram, process the data to be processed, and complete the reasoning process of the target neural network model.

Referring to Figure 8, which is a schematic structural diagram of a second model inference device provided by an embodiment of the present disclosure. Compared with the embodiment shown in Figure 7, the above model inference module 702 includes:

The data processing sub-module 702A is configured to, after receiving multiple data to be processed sent by the CPU, merge the multiple data to be processed into merged data, call the operation core diagram, and execute the operation represented by the operation core diagram. Run each computing core in sequence to process the merged data to complete the reasoning process of the target neural network model, wherein the plurality of data to be processed are data to be processed by the target neural network model;

The result feedback module 703 includes:

The result feedback sub-module 703A is configured to extract the model inference results corresponding to each data to be processed from the model inference results of the merged data, and feed back the model inference results corresponding to each data to be processed to the CPU respectively.

In one embodiment of the present disclosure, the computing core graph receiving module 701 is specifically used to:

The data processing sub-module 702A is specifically used for:

After receiving the plurality of data to be processed sent by the CPU, the plurality of data to be processed are merged into merged data, and based on the second data amount, a second computing core graph is selected from each computing core graph, wherein the third The second data volume is: the product of the maximum data volume of each data to be processed and the number of data to be processed. The second computing core diagram is: the computing core whose corresponding computing scale is greater than or equal to the second data volume and is closest to the second data volume. picture;

The second computing core graph is called, each computing core is sequentially run according to the running order represented by the second computing core graph, the merged data is processed, and the reasoning process of the target neural network model is completed.

In one embodiment of the present disclosure, the size of the pre-allocated storage space required by the GPU to complete the inference of the target neural network model is not less than the third data amount, the fourth data amount and the maximum The sum of the required storage space;

Wherein, the third data amount is the data amount of the target neural network model, the fourth data amount is the sum of data amounts of operation results obtained after data processing based on each operation core, and the maximum required storage The space is the maximum storage space required for data processing based on each computing core.

It can be seen from the above that the size of the above-mentioned pre-allocated storage space is greater than or equal to the sum of the third data amount, the fourth data amount and the maximum required storage space, so that the GPU can normally complete model inference based on the above-mentioned pre-allocated storage space. process.

Corresponding to the above model reasoning method applied to the CPU, embodiments of the present disclosure also provide a model reasoning device applied to the CPU.

Referring to Figure 9, which is a schematic structural diagram of a third model reasoning device provided by an embodiment of the present disclosure. The above device includes the following modules 901-903.

The computing core graph sending module 901 is used to send the pre-constructed computing core graph to the graphics processor GPU, where each node in the computing core graph corresponds to each computing core included in the target model, and the direction of each edge Used to represent the running order of the computing cores corresponding to the nodes connected by the edge;

The data sending module 902 is used to send the data to be processed to the GPU when there is a target processing request, so that the GPU runs each computing core in sequence according to the running order represented by the computing core graph, based on the operation core in the video memory. The preset storage space is used to process the data to be processed, complete the inference process of the target neural network model, and feed back the model inference results to the CPU, where the target processing requirement is to use the target neural network Model requests to process data to be processed;

The first result receiving module 903 is used to receive the model inference result fed back by the GPU.

In one embodiment of the present disclosure, the computing core map sending module 901 is specifically used to:

Send each pre-built computing core graph to the GPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph represents: the data that the computing cores contained in the computing core graph can process. quantity.

In one embodiment of the present disclosure, the above device further includes:

The computing core sending module is used to send the data to be processed to the GPU when it is determined that the calculation scale corresponding to the pre-built computing core graph is smaller than the data volume of the data to be processed, and calculate each target in a preset order. The core is sent to the GPU, so that the GPU sequentially runs each target computing core in the order in which it receives each target computing core, processes the data to be processed, completes the inference process of the target neural network model, and reports to the CPU feedback model inference results;

Wherein, the calculation scale corresponding to the operation core graph represents the amount of data that the operation cores included in the operation core graph can process, and the preset order is the execution order of each target operation core specified by the target neural network model, The amount of data that the target computing core can process is not less than the amount of data to be processed;

The second result receiving module is used to receive the model inference result fed back by the GPU.

In the technical solution of this disclosure, the collection, storage, use, processing, transmission, provision and disclosure of user personal information are in compliance with relevant laws and regulations and do not violate public order and good customs.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

An embodiment of the present disclosure provides an electronic device, including:

at least one CPU; and

A memory communicatively connected to the at least one CPU; wherein,

The memory stores instructions that can be executed by the at least one CPU, and the instructions are executed by the at least one CPU to enable the at least one CPU to execute any one of the model inference methods applied to the GPU. Method steps.

Embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute a model inference method applied to a CPU and a model inference method applied to a GPU.

Embodiments of the present disclosure provide a computer program product, including a computer program that, when executed by a processor, implements a model inference method applied to a CPU and a model inference method applied to a GPU.

Figure 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 10, the device 1000 includes a GPU 1001, which can execute various tasks according to a computer program stored in a read-only memory (ROM) 1002 or loaded from a storage unit 1008 into a random access memory (RAM) 1003. Proper action and handling. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. GPU 1001, ROM 1002 and RAM 1003 are connected to each other through bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Multiple components in the device 1000 are connected to the I/O interface 1005, including: input unit 1006, such as a keyboard, mouse, etc.; output unit 1007, such as various types of displays, speakers, etc.; storage unit 1008, such as a magnetic disk, optical disk, etc. ; and communication unit 1009, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.

GPU 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. GPU 1001 performs various methods and processes described above, such as model inference methods. For example, in some embodiments, the speech translation method and the model training method can be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When a computer program is loaded into RAM 1003 and executed by GPU 1001, one or more steps of the model inference method described above may be performed. Alternatively, in other embodiments, GPU 1001 may be configured to perform model inference methods in any other suitable manner (eg, via firmware).

Figure 11 shows a schematic block diagram of an example electronic device 1100 that may be used to implement another embodiment of the present disclosure. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 11 , the device 1100 includes a CPU 1101 that can execute various functions according to a computer program stored in a read-only memory (ROM) 1102 or loaded from a storage unit 1108 into a random access memory (RAM) 1103 . Proper action and handling. In the RAM 1103, various programs and data required for the operation of the device 1100 can also be stored. CPU 1101, ROM 1102 and RAM 1103 are connected to each other through bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

Multiple components in the device 1100 are connected to the I/O interface 1105, including: input unit 1106, such as a keyboard, mouse, etc.; output unit 1107, such as various types of displays, speakers, etc.; storage unit 1108, such as a magnetic disk, optical disk, etc. ; and communication unit 1109, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.

CPU 1101 may be a variety of general and/or special purpose processing components having processing and computing capabilities. The CPU 1101 executes various methods and processes described above, such as the model inference method. For example, in some embodiments, the speech translation method and the model training method can be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When the computer program is loaded into RAM 1103 and executed by CPU 1101, one or more steps of the model inference method described above may be performed. Alternatively, in other embodiments, the CPU 1101 may be configured to perform the model inference method in any other suitable manner (eg, by means of firmware).

Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.

The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.

Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, a distributed system server, or a server combined with a blockchain.

It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in the present disclosure can be executed in parallel, sequentially, or in a different order. As long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, there is no limitation here.

The above-mentioned specific embodiments do not constitute a limitation on the scope of the present disclosure. It will be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions are possible depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this disclosure shall be included in the protection scope of this disclosure.

Claims

A model inference method applied to graphics processors GPU, including:

Receive the computing core graph corresponding to the target neural network model sent by the CPU, where each node in the computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to represent the connection to which the edge is connected. The running order of the computing cores corresponding to the nodes;

After receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the inference process of the target neural network model;

Feed back the model inference results to the CPU.
The method according to claim 1, wherein the computing kernel graph corresponding to the target neural network model sent by the receiving CPU includes:

Receive multiple computing core graphs corresponding to the target neural network model sent by the CPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates that the computing cores contained in the computing core graph can process The amount of data;

Running each computing core sequentially according to the running order represented by the computing core diagram to process the data to be processed includes:

Based on the first data amount of the data to be processed, select a first operation core graph from each operation core graph, wherein the corresponding calculation scale of the first operation core graph is not less than the first data amount and the maximum A computing core graph close to the first data amount;

Each computing core is sequentially run according to the running order represented by the first computing core diagram, and the data to be processed is processed.
The method according to claim 1, wherein after receiving the data to be processed sent by the CPU, each computing core is sequentially run according to the running order represented by the computing core diagram to process the data to be processed. ,include:

After receiving multiple data to be processed sent by the CPU, the multiple data to be processed are merged into merged data, the computing core graph is called, and each computing core is sequentially run according to the running order represented by the computing core graph. The merged data is processed, wherein the plurality of data to be processed are data to be processed by the target neural network model;

The feedback of model inference results to the CPU includes:

The model inference results corresponding to each data to be processed are extracted from the model inference results of the merged data, and the model inference results corresponding to each data to be processed are fed back to the CPU respectively.
The method according to claim 3, wherein the operation kernel graph corresponding to the target neural network model sent by the receiving CPU includes:

Receive multiple computing core graphs corresponding to the target neural network model sent by the CPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates that the computing cores contained in the computing core graph can process The amount of data;

The method of calling the operation core graph, running each operation core sequentially according to the running order represented by the operation core graph, and processing the merged data includes:

Based on the second data amount, select a second operation core graph from each operation core graph, wherein the second data amount is: the product of the maximum data amount of each data to be processed and the number of data to be processed, and the second operation core graph is The kernel graph is: a computing core graph whose corresponding calculation scale is greater than or equal to and closest to the second data amount;

The second computing core graph is called, each computing core is sequentially run according to the running order represented by the second computing core graph, and the merged data is processed.
The method according to any one of claims 1 to 4, wherein the size of the pre-allocated storage space required by the GPU to complete the inference of the target neural network model is not less than a third amount of data, The fourth amount of data and the sum of the maximum required storage space;

Wherein, the third data amount is the data amount of the target neural network model, the fourth data amount is the sum of data amounts of operation results obtained after data processing based on each operation core, and the maximum required storage The space is the maximum storage space required for data processing based on each computing core.
A model inference method, applied to CPU, including:

Send the pre-constructed computing core graph to the graphics processor GPU, where each node in the computing core graph corresponds to each computing core included in the target neural network model, and the direction of each edge is used to represent the connection to which the edge is connected. The running order of the computing cores corresponding to the nodes;

When there is a target processing request, the data to be processed is sent to the GPU, so that the GPU sequentially runs each computing core according to the running order represented by the computing core diagram, and processes the data to be processed, and completes The inference process of the target neural network model, and feedback of the model inference results to the CPU, wherein the target processing requirement is a request to use the target neural network model to process the data to be processed;

Receive model inference results fed back by the GPU.
The method according to claim 6, wherein sending the pre-constructed computing kernel graph to the GPU includes:

Each pre-constructed computing core graph is sent to the GPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates: the amount of data that the computing cores included in the computing core graph can process.
The method according to claim 6 or 7, wherein after sending the pre-constructed computing kernel graph to the graphics processor GPU, it further includes:

When it is determined that the calculation scale corresponding to the pre-built computing core graph is smaller than the data amount of the data to be processed, the data to be processed is sent to the GPU, and each target computing core is sent to the GPU in a preset order, so that The GPU sequentially runs each target computing core in the order in which it is received, processes the data to be processed, completes the reasoning process of the target neural network model, and feeds back the model reasoning results to the CPU;

Wherein, the calculation scale corresponding to the operation core graph represents the amount of data that the operation cores included in the operation core graph can process, and the preset order is the execution order of each target operation core specified by the target neural network model, The amount of data that the target computing core can process is not less than the amount of data to be processed;

Receive model inference results fed back by the GPU.
A model reasoning device, applied to graphics processor GPU, including:

The computing core graph receiving module is used to receive the computing core graph corresponding to the target neural network model sent by the CPU, wherein each node in the computing core graph corresponds to each computing core included in the target neural network model, and each edge The direction is used to indicate the running order of the computing cores corresponding to the nodes connected by the edge;

The model reasoning module is used to, after receiving the data to be processed sent by the CPU, run each computing core in sequence according to the running order represented by the computing core diagram, process the data to be processed, and complete the target neural network. The reasoning process of the model;

A result feedback module is used to feed back model inference results to the CPU.
The device according to claim 9, wherein the computing core graph receiving module is specifically used for:

Receive multiple computing core graphs corresponding to the target neural network model sent by the CPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates that the computing cores contained in the computing core graph can process The amount of data;

The model inference module is specifically used for:

After receiving the data to be processed sent by the CPU, based on the first data amount of the data to be processed, a first computing core graph is selected from each computing core graph, wherein the first computing core graph is the corresponding The calculation scale is not less than the first data amount and is closest to the calculation core graph of the first data amount;

Run each computing core sequentially according to the running order represented by the first computing core diagram, process the data to be processed, and complete the reasoning process of the target neural network model.
The device according to claim 9, wherein the model inference module includes:

The data processing submodule is used to merge the multiple data to be processed into merged data after receiving multiple data to be processed sent by the CPU, call the operation core diagram, and follow the running sequence represented by the operation core diagram. Run each computing core in sequence to process the merged data to complete the reasoning process of the target neural network model, where the plurality of data to be processed are data to be processed by the target neural network model;

The result feedback module includes:

The result feedback submodule is used to extract the model inference results corresponding to each data to be processed from the model inference results of the merged data, and feed back the model inference results corresponding to each data to be processed to the CPU respectively.
The device according to claim 11, wherein the computing core graph receiving module is specifically used for:

Receive multiple computing core graphs corresponding to the target neural network model sent by the CPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates that the computing cores contained in the computing core graph can process The amount of data;

The data processing sub-module is specifically used for:

After receiving a plurality of data to be processed sent by the CPU, the plurality of data to be processed are merged into merged data, and based on the second data amount, a second computing core graph is selected from each computing core graph, wherein the third computing core graph is The second data volume is: the product of the maximum data volume of each data to be processed and the number of data to be processed. The second computing core diagram is: the computing core whose corresponding computing scale is greater than or equal to the second data volume and is closest to the second data volume. picture;

The second computing core graph is called, each computing core is sequentially run according to the running order represented by the second computing core graph, the merged data is processed, and the reasoning process of the target neural network model is completed.
The device according to any one of claims 9-12, wherein the size of the pre-allocated storage space required by the GPU to complete the inference of the target neural network model is not less than a third amount of data, The fourth amount of data and the sum of the maximum required storage space;

Wherein, the third data amount is the data amount of the target neural network model, the fourth data amount is the sum of data amounts of operation results obtained after data processing based on each operation core, and the maximum required storage The space is the maximum storage space required for data processing based on each computing core.
A model inference device, applied to CPU, including:

The computing core graph sending module is used to send the pre-constructed computing core graph to the graphics processor GPU, where each node in the computing core graph corresponds to each computing core included in the target model, and the direction of each edge is represented by Yu represents the running order of the computing cores corresponding to the nodes connected by the edge;

A data sending module, configured to send the data to be processed to the GPU when there is a target processing request, so that the GPU runs each computing core in sequence according to the running order represented by the computing core diagram, based on the operation core in the video memory. Preset storage space, process the data to be processed, complete the inference process of the target neural network model, and feed back the model inference results to the CPU, where the target processing requirement is to use the target neural network model Requests for processing of pending data;

The first result receiving module is used to receive the model inference result fed back by the GPU.
The device according to claim 14, wherein the computing core map sending module is specifically used for:

Each pre-constructed computing core graph is sent to the GPU. Different computing core graphs correspond to different calculation scales. The computing scale corresponding to each computing core graph indicates: the amount of data that the computing cores included in the computing core graph can process.
The device of claim 14 or 15, further comprising:

The computing core sending module is used to send the data to be processed to the GPU when it is determined that the calculation scale corresponding to the pre-built computing core graph is smaller than the data volume of the data to be processed, and calculate each target in a preset order. The core is sent to the GPU, so that the GPU sequentially runs each target computing core in the order in which it receives each target computing core, processes the data to be processed, completes the inference process of the target neural network model, and reports to the CPU feedback model inference results;

Wherein, the calculation scale corresponding to the operation core graph represents the amount of data that the operation cores included in the operation core graph can process, and the preset order is the execution order of each target operation core specified by the target neural network model, The amount of data that the target computing core can process is not less than the amount of data to be processed;

The second result receiving module is used to receive the model inference result fed back by the GPU.
An electronic device including:

At least one GPU; and

A memory communicatively connected to the at least one GPU; wherein,

The memory stores instructions executable by the at least one GPU, and the instructions are executed by the at least one GPU to enable the at least one GPU to perform the method of any one of claims 1-5.
An electronic device including:

at least one CPU; and

A memory communicatively connected to the at least one CPU; wherein,

The memory stores instructions executable by the at least one CPU, and the instructions are executed by the at least one CPU to enable the at least one CPU to perform the method of any one of claims 6-8.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in any one of claims 1-5 or 6-8.
A computer program product, comprising a computer program that implements the method of any one of claims 1-5 or 6-8 when executed by a processor.