WO2021114757A1

WO2021114757A1 - Optimization method and apparatus for computation graph, computer device, and storage medium

Info

Publication number: WO2021114757A1
Application number: PCT/CN2020/113290
Authority: WO
Inventors: 周舒畅; 王田
Original assignee: 北京迈格威科技有限公司
Priority date: 2019-12-09
Filing date: 2020-09-03
Publication date: 2021-06-17
Also published as: CN111158901B; CN111158901A

Abstract

An optimization method and apparatus for a computation graph, a computer device, and a storage medium. The method comprises: obtaining a computation graph of a computation network model; then inserting at least one check node in the computation graph; when running to each check node, obtaining the current performance margin by means of the current check node; then determining an optimization policy according to the current performance margin; and according to the optimization policy, optimizing a resource that needs to be consumed by a computation node after the current check node. According to the optimization method, by inserting the check nodes, obtaining the current performance margin of a computer device when the computer device runs to each check node, and according to the current performance margin, selecting the optimization policy meeting the actual running condition of the computer device to optimize the resource that needs to be consumed by the computation node, the resource use condition of each computation node during computation can be dynamically adjusted in the process that the computer device runs to the computation node, thereby improving the utilization rate of the resource on the computer device.

Description

Optimization method, device, computer equipment and storage medium of calculation graph

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911249112.X, and the invention title is "Computer Graph Optimization Method, Apparatus, Computer Equipment, and Storage Medium" on December 9, 2019. All of them The content is incorporated in this application by reference.

Technical field

This application relates to the field of computer technology, and in particular to a method, device, computer equipment, and storage medium for optimizing a calculation graph.

Background technique

With the development of computer network technology and the advent of the era of big data, computational network models applied in various technical fields are becoming more and more complex, but computational network models with high complexity, such as neural network models, are more important for the hardware of computer equipment. Indicators pose challenges. Therefore, how to optimize the computational network model has become a problem that researchers are paying more attention to.

The optimization process of the existing computing network model is to use a unified optimization method to optimize the computing network model, that is, to design the computing network model and the application environment for the computing network model and the combined application environment according to the hardware index requirements proposed by the user. The corresponding optimization model enables the computer equipment resources consumed when the optimization model is compiled and run in the later stage to meet the performance index requirements proposed by the user.

However, the above optimization method can only be applied to the same computing network model and application environment. When the computing network model and application environment are changed, the corresponding optimization method needs to be redesigned. Therefore, the adaptability of the above optimization method is extremely low, which leads to The operating efficiency of the computational network model is extremely low.

Summary of the invention

Based on this, it is necessary to provide a method, device, computer equipment, and storage medium for optimizing calculation graphs that can effectively improve adaptability and execution efficiency in response to the above technical problems.

In the first aspect, a method for optimizing a calculation graph, the method includes:

Obtain the calculation graph of the calculation network model; the calculation graph includes multiple calculation nodes;

Insert at least one check node in the calculation graph;

When running to each inspection node, obtain the current performance margin through the current inspection node;

Determine the optimization strategy according to the current delay performance margin;

According to the optimization strategy, the computing nodes after the current check node are optimized.

In one of the embodiments, the current performance margin includes the current delay performance margin. According to the current performance margin, determining the optimization strategy includes: if the current delay performance margin is sufficient, determining the storage optimization strategy as the optimization strategy; The storage optimization strategy is used to reduce the memory occupied by the computing node during the calculation; if the current delay performance margin is not sufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the calculation of the computing node during the calculation It takes a long time.

In one of the embodiments, the current performance margin includes the current storage performance margin, and the optimization strategy is determined according to the current delay performance margin, including: if the current storage performance margin is sufficient, determining the delay optimization strategy as the optimization strategy ; The delay optimization strategy is used to reduce the time consumed by the computing node in the calculation; if the current storage performance margin is not sufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node in the calculation .

In one of the embodiments, the storage optimization strategy includes: storing the data generated by the computing node after the check node during calculation in a storage space with high access latency; the storage space with high access latency includes at least global memory and off-chip memory; And/or, the delay optimization strategy includes: storing the data generated by the computing node after the check node during calculation in a storage space with low access latency; the storage space with low access latency includes at least cache space and on-chip storage.

In one of the embodiments, the delay optimization strategy further includes: obtaining the size of the data generated by the computing node after the check node during the calculation; comparing the size of the data generated by the computing node during the calculation with the size of the preset storage memory ; If the size of the data generated by the computing node during the calculation exceeds the preset storage memory size, the computing node after the check node will be split, and the data generated by the split computing node during the calculation will be stored in the storage with low access latency Space; if the size of the data generated by the computing node during the calculation does not exceed the preset storage memory size, the data generated during the calculation by the computing node after the check node is stored in a storage space with low access latency.

In one of the embodiments, obtaining the current performance margin through the current check node includes: obtaining the first total target calculation consumption time length and the total actual calculation consumption time length of all computing nodes before the current check node; according to the first total target Calculate the consumption time and the total actual calculation consumption time to determine the current delay performance margin.

In one of the embodiments, obtaining the first total target calculation consumption time length of all computing nodes before the current check node includes: obtaining the second total target calculation consumption time length of all the computing nodes on the path where the current check node is located; The second total target calculation consumption time and the preset ratio are used to determine the first total target calculation consumption time; the preset ratio is the total calculation consumption time of all the computing nodes before the current check node and the total calculation consumption time of all the computing nodes on the path where the check node is located. The percentage of the total computing time consumed.

In one of the embodiments, inserting at least one check node in the calculation graph includes: obtaining the calculation consumption time ratio of each calculation node on the longest path in the calculation graph; and determining at least one check node on the longest path according to the calculation consumption time ratio Insertion position of the inspection node; insert at least one inspection node at the insertion position of the at least one inspection node.

In one of the embodiments, obtaining the proportion of computing time consumed by each computing node on the longest path in the calculation graph includes: obtaining the calculation amount of each computing node on the longest path; obtaining the calculation amount of each computing node on the longest path The computing time consumed by each computing node; according to the computing time consumed by each computing node on the longest path, the ratio of computing time consumed by each computing node on the longest path is determined.

In one of the embodiments, obtaining the calculation consumption time ratio of each computing node on the longest path in the calculation graph includes: constructing a consumption time estimation model; using the consumption time estimation model to obtain the calculation of each computing node on the longest path Time consumption: According to the calculation time consumption of each computing node on the longest path, the ratio of the calculation time consumption of each computing node on the longest path is determined.

In one of the embodiments, determining the insertion position of at least one check node on the longest path according to the calculated consumption time ratio includes: dividing the longest path into a preset number of multiple sub-paths according to the calculated consumption time ratio; At least one of the multiple sub-paths is selected as the insertion position for inserting the checkpoint.

In one of the embodiments, inserting at least one check node in the calculation graph includes: obtaining a start-end calculation node and an end calculation node that are separated by at least one calculation node in the calculation graph; inserting at least one check node in the middle position between the start-end calculation node and the end calculation node A check node.

In one of the embodiments, obtaining the calculation graph of the calculation network model includes: loading the topology structure and parameters of the calculation network model; compiling the topology structure and parameters of the calculation network model to obtain the calculation graph of the calculation network model.

In a second aspect, an optimization device for a calculation graph, the device includes:

The first obtaining module is used to obtain a calculation graph of a calculation network model; the calculation graph includes a plurality of calculation nodes;

Insert module, used to insert at least one check node in the calculation graph;

The second obtaining module is used to obtain the current performance margin through the current check node when running to each check node;

The determining module is used to determine the optimization strategy according to the current performance margin;

The optimization module is used to optimize the computing nodes after the current check node according to the optimization strategy.

In a third aspect, a computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the calculation graph optimization method described in any embodiment of the first aspect when the processor executes the computer program.

In a fourth aspect, a computer-readable storage medium has a computer program stored thereon, and the computer program, when executed by a processor, implements the method for optimizing the calculation graph described in any one of the embodiments of the first aspect.

The method, device, computer equipment, and storage medium for optimizing calculation graphs provided in this application obtain a calculation graph of a calculation network model including multiple calculation nodes, and then insert at least one check node in the calculation graph, and when it runs to At each check node, the current performance margin is obtained through the current check node, and then an optimization strategy is determined according to the current performance margin, and the required consumption resources of the computing nodes after the current check node are optimized according to the optimization strategy. The above optimization method obtains the current performance margin of the computer equipment when it runs to each check node by inserting the check node, and then selects the optimization strategy that meets the actual operation of the computer equipment according to the current performance margin to consume the computing node after the check node Resources are optimized, so that when the computer equipment is running to the above-mentioned computing node, the resource usage of each computing node in the calculation graph can be dynamically adjusted to meet the performance index requirements of the user for the calculation graph, and to improve the computer equipment Resource utilization.

The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other objectives, features and advantages of the present invention more obvious and understandable. In the following, specific embodiments of the present invention are specifically cited.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present invention or the prior art, the following will describe the embodiments or the prior art. The drawings that need to be used in the description are briefly introduced. Obviously, the drawings in the following description are some implementations of the present invention. For example, for those of ordinary skill in the art, without creative work, they can also obtain information from these drawings. Other drawings.

FIG. 1 is a schematic diagram of the internal structure of a computer device provided by an embodiment;

FIG. 2 is a flowchart of a method for optimizing a calculation graph according to an embodiment;

2A is a flowchart of a method for optimizing a calculation graph provided by an embodiment;

FIG. 3 is a flowchart of an implementation manner of S103 in the embodiment of FIG. 2;

FIG. 4 is a flowchart of an implementation manner of S201 in the embodiment of FIG. 3;

FIG. 5 is a flowchart of an implementation manner of S102 in the embodiment of FIG. 2;

FIG. 6 is a flowchart of an implementation manner of S401 in the embodiment of FIG. 5;

FIG. 7 is a flowchart of another implementation manner of S401 in the embodiment of FIG. 5;

FIG. 8 is a flowchart of another implementation manner of S402 in the embodiment of FIG. 5;

FIG. 8A is a schematic structural diagram of a calculation graph provided by an embodiment;

FIG. 8B is a schematic structural diagram of a calculation graph provided by an embodiment;

FIG. 9 is a flowchart of another implementation manner of S102 in the embodiment of FIG. 2;

FIG. 9A is a schematic structural diagram of a calculation graph provided by an embodiment;

Fig. 10 is a flowchart of an implementation manner of S101 in the embodiment of Fig. 2;

FIG. 11 is a flowchart of a method for optimizing a calculation graph provided by an embodiment;

FIG. 12 is a schematic structural diagram of an optimization device for a calculation graph provided by an embodiment; FIG.

Fig. 13 schematically shows a block diagram of a computing processing device for executing the method according to the present invention;

Fig. 14 schematically shows a storage unit for holding or carrying program codes for implementing the method according to the present invention.

Specific embodiment

In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. Obviously, the described embodiments are part of the embodiments of the present invention, rather than all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

The method for optimizing the calculation graph provided in this application can be applied to the computer device as shown in FIG. 1. The computer device may be a server or a terminal, and its internal structure diagram may be as shown in FIG. The computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize an optimization method of the calculation graph. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.

Those skilled in the art can understand that the structure shown in FIG. 1 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

Hereinafter, the technical solution of the present application and how the technical solution of the present application solves the above-mentioned technical problems will be described in detail through the embodiments and the accompanying drawings. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 2 is a flowchart of a method for optimizing a calculation graph provided by an embodiment. The execution body of the method is the computer device in Fig. 1, and the method involves the computer device running the calculation graph of the calculation network model. The specific process of optimization of the calculation graph. As shown in Figure 2, the method specifically includes the following steps:

S101. Obtain a calculation graph of a calculation network model; the calculation graph includes a plurality of calculation nodes.

Among them, the computing network model may be constructed by computer equipment in advance according to actual application requirements, and it may specifically be a computing network model with various functional applications, such as a neural network model, a machine learning network model, and an intelligent algorithm network model. Computational graphs are a kind of "language" for describing calculation methods, which are specifically composed of multiple computing nodes, and multiple computing nodes with dependencies are connected to each other. The computing node may include code for executing a certain computing function, so that when the computer device runs to the computing node, it can execute the corresponding computing task in the computing network model.

In this embodiment, the computer device can compile a preset calculation network model through a compiler to generate a compiled calculation graph. Optionally, the computer device may also directly obtain the calculation graph of the calculation network model after the compilation meeting through other methods, which is not limited in this embodiment. Optionally, before the compiler compiles the computational network model, the computer device may also construct the computational network model according to actual application requirements in advance, and then compile the computational network model based on the constructed computational network model for later use. Optionally, the computer device can also directly obtain a pre-compiled computing network model, and then compile it based on the obtained computing network model for later use during runtime, which is not limited in this embodiment.

S102. Insert at least one check node in the calculation graph.

Wherein, the check node may include code for executing a certain calculation or test function, so that when the computer device runs to the check node, the corresponding calculation or test task can be executed, and the check node may be pre-configured by the computing device. In this embodiment, when the computer device obtains the calculation graph of the calculation network model and needs to optimize the calculation graph in the process of running the calculation graph later, at least one check node may be further inserted into the calculation graph to make When the calculation graph is run to the inserted check node, the computer device can detect the resource consumption of the computer device at the current moment, so as to dynamically adjust the resource utilization mode of the subsequent calculation node according to the current resource consumption, so that the calculation graph is in the process of being executed In the process, the consumption of resources can always meet the performance indicators proposed by the user or reach the optimum, making full use of the resources on the computer equipment.

S103: When running to each check node, obtain the current performance margin through the current check node.

Wherein, the current performance margin represents the margin between the computing device resources actually consumed when the computing device runs to the current check node and the computing device resources indicated by the user's expected performance index, and the current performance margin may represent delay performance The performance margin of the index may also be the performance margin of the storage performance index, or the performance margin of other types of performance indexes consumed by the computer equipment during calculation. Specifically, the above performance margin representing the delay performance index refers to the margin between the actual calculation consumption time consumed when the computing device runs to the current check node and the calculation consumption time expected by the user; the above represents the performance margin of the storage index Refers to the margin between the memory size actually consumed when the computing device runs to the current check node and the memory size expected to be consumed by the user. In practical applications, if the actual consumption of computing device resources is greater than or equal to the computing device resources indicated by the user's expected performance indicators, it means that the current performance margin is insufficient. If the actual consumed computing device resources are less than the user's expected performance indicators, When calculating equipment resources, it indicates that the current performance margin is sufficient. In this embodiment, when the computer device runs to each current check node, the computer device can obtain the current performance margin of the computing device by executing the code on the check node, so that the computer device can then obtain the current performance margin according to the current performance margin. The computing nodes after the checkpoint are optimized in different ways, so that the optimized computing nodes can make full use of the resources of the computing device when they are executed.

S104: Determine an optimization strategy according to the current performance margin.

The optimization strategy is used to optimize the resources consumed by the computing nodes after the check node, so that the resources consumed by the subsequent computing nodes when executed can meet user needs or match the performance indicators of the computing device. In this embodiment, when the computer device obtains the current performance margin through the current check node, it can determine whether the current performance margin is sufficient, and then select different optimization strategies according to the judgment result to dynamically optimize the calculation after the check node in the graph. calculate node. For example, if the current performance margin of the delay performance index is sufficient, the optimization strategy of reducing memory can be used to compile and run to reduce the performance consumption of the computer equipment during storage, so that the performance indicators of all aspects of the computer equipment can meet the user Demand, if the current delay performance margin of the delay performance index is not sufficient, the optimization strategy of reducing the memory access operation with high access delay can be used to compile and run, so as to reduce the delay performance consumption of computer equipment calculation. So that the performance indicators of all aspects of computer equipment can meet the needs of users. Correspondingly, if it indicates that the current performance margin of the storage performance index is sufficient, the optimization strategy of reducing the memory access operation with high access latency can be used to compile and run, so as to reduce the delay performance consumption of computer equipment calculation; if it indicates that the storage The current performance margin of the performance index is sufficient to reduce the performance consumption of the computer equipment during storage.

S105: Optimize computing nodes after the current check node according to the optimization strategy.

In this embodiment, after the computer device determines the optimization strategy according to the current performance margin, the parameters or variables on the computing node after the current check node can be optimized according to the optimization strategy. For example, the computer device can change the parameters or variables on the computing node. The storage method of parameters or variables, thereby changing the length of time when the computing node reads or writing data, and then changes the computing time when the computing device runs to the computing node, so as to improve the delay performance of the computing device and complete the calculation Optimization of nodes. For another example, a computer device can also split a computing node, so that the resources consumed by one computing node are divided into resources consumed by multiple computing nodes, so as to reduce the burden of resource consumption when the computing device runs to each computing node, and complete the computing node Optimization.

The method for optimizing the calculation graph provided in this embodiment obtains a calculation graph of a calculation network model including multiple calculation nodes, inserts at least one check node in the calculation graph, and when it runs to each check node, passes the current The check node obtains the current performance margin, and then determines an optimization strategy according to the current performance margin, and optimizes the required consumption resources of the computing nodes after the current check node according to the optimization strategy. The above optimization method obtains the current performance margin of the computer equipment when it runs to each check node by inserting the check node, and then selects the optimization strategy that meets the actual operation of the computer equipment according to the current performance margin to consume the computing node after the check node Resources are optimized, so that when the computer equipment is running to the above-mentioned computing node, the resource usage of each computing node in the calculation graph can be dynamically adjusted to meet the performance index requirements of the user for the calculation graph, and to improve the computer equipment Resource utilization.

In an embodiment, the foregoing current performance margin includes a performance margin representing a delay performance index, that is, the current delay performance margin. In this application scenario, this application provides an implementation manner of the foregoing S104. Including: if the current delay performance margin is sufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation.

This embodiment relates to an application scenario in which a computer device obtains a sufficient margin of current delay performance, indicating that the computing device requires a sufficient amount of time for calculation at this time, and it can also meet the calculation requirements of later computing nodes. In this application, the computer The device does not need to pay attention to the computing time consumed by the computing node during the calculation, but can focus on the memory occupied by the computing node during the calculation to optimize the memory resources on the computing device and prevent the computer device from being occupied too much, thereby affecting the computer device The calculation performance, which in turn affects the calculation speed at which the calculation graph is executed.

Optionally, the aforementioned storage optimization strategy may specifically include: storing data generated by the computing node after the check node during calculation in a storage space with high access latency; the storage space with high access latency includes at least global memory and off-chip memory.

The data generated by the calculation node during calculation may include intermediate results and temporary variables required in the calculation. When the computer device optimizes the computing node after the check node according to the storage optimization strategy, it can specifically store the data generated by the computing node after the check node in the storage space with high access latency, for example, the global memory of the GPU or the TPU. Off-chip storage, etc., to reduce the occupancy rate of the computer equipment memory, thereby increasing the computing speed of the computer equipment.

Optionally, based on the foregoing embodiment, if the current delay performance margin is insufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the computational time spent by the computing node during calculation.

This embodiment relates to an application scenario where the current delay performance margin is not sufficient for the computer equipment to obtain, which indicates that the computing time required by the computer equipment at this time is relatively tight, and may not be able to meet the computing requirements of the later computing nodes. In this application , Computer equipment needs to focus on the computing time consumed by computing nodes during calculations. You can ignore the memory occupied by computing nodes during computing to optimize the delay performance of computing equipment and avoid computing nodes from consuming too long time during computing. , Thereby affecting the calculation speed of the calculation graph is executed.

Optionally, the above-mentioned delay optimization strategy may specifically include: storing data generated by the computing node after the check node during calculation in a low-access-latency storage space; the low-access-latency storage space includes at least cache space and on-chip storage.

In this embodiment, when the computer device optimizes the computing node after the check node according to the delay optimization strategy, it can specifically store the data generated by the computing node after the check node in the storage space with low access latency, for example, The memory or cache of the computer equipment, etc., to reduce the time for the computing node to access the storage space during calculation, thereby increasing the computing speed of the computing node, and then increasing the computing speed of the computer equipment.

In an embodiment, the foregoing current performance margin includes a performance margin representing a storage performance index, that is, the current storage performance margin. In this application scenario, this application provides an implementation manner of the foregoing S104, and the method includes: If the current storage performance margin is sufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the time consumed by the computing node during calculation.

This embodiment relates to an application scenario in which a computer device obtains a sufficient current storage performance margin, indicating that the memory resource required by the computer device at this time is relatively sufficient, and it can also meet the computing needs of the later computing node. In this application, the computer device needs Focus on the computational time consumed by the computing node during the calculation. You can ignore the memory occupied by the computing node during the calculation to optimize the delay performance of the computing device and avoid the computing node from spending too much time during the calculation, which affects the calculation The calculation speed at which the graph is executed.

Optionally, based on the foregoing embodiment, if the current storage performance margin is insufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation.

This embodiment relates to an application scenario where a computer device obtains the current storage performance margin is insufficient, indicating that the memory resource required by the computer device at this time is relatively tight, and may not be able to meet the computing needs of the later computing node. In this application, the computer The device needs to focus on the memory occupied by the computing node during the calculation. You can ignore the time consumption of the computing node during the calculation to optimize the storage performance of the computing device and avoid the computing node from occupying too much memory during the calculation, which affects the calculation graph. The speed of calculation performed.

Optionally, in actual applications, based on the aforementioned delay optimization strategy, as shown in FIG. 2A, the delay optimization strategy may further include:

S1041. Obtain the size of the data generated during the calculation by the computing node after the current check node.

This embodiment is applicable to an application scenario when the memory or cache on a computer device cannot meet the memory or cache required by the computing node for calculation. In this application scenario, when the optimization strategy determined by the computer device is a delay optimization strategy, The data size generated by the computing node after the current check node during calculation can be obtained first, so as to estimate whether the memory or cache on the computer device meets the computing requirements based on the data size.

S1042. Compare the size of the data generated by the computing node during calculation with the size of the preset storage space. If the size of the data generated by the computing node during calculation exceeds the size of the preset storage space, step S1043 is executed. If the computing node The size of the data generated during the calculation does not exceed the size of the preset storage space, then step S1044 is executed.

In this embodiment, when the computer device obtains the size of the data generated by the computing node during calculation, it can further compare the size of the data generated by the computing node during calculation with the size of the preset storage space to obtain a comparison result. Then, according to the comparison result, different delay optimization strategies are selected to optimize the computing nodes after the current check node. The aforementioned preset storage space may be a storage space with low access latency, such as the memory and/or cache space of a computer device. The above comparison results include: the size of the data generated by the computing node during the calculation exceeds the size of the preset storage space. At this time, it indicates that the existing storage space on the computer device cannot meet the computing requirements of the computing node, and the computing node generated during the calculation. The size of the data does not exceed the size of the preset storage space, which indicates that the existing storage space on the computer device is still relatively abundant, which can meet the computing requirements of the computing node.

S1043: Split the computing node after the current check node, and store the data generated during the calculation of the split computing node in a storage space with low access latency.

This embodiment relates to an application scenario in which the above-mentioned comparison result is that the size of the data generated by the computing node during calculation exceeds the size of the preset storage space. In this application, the computer device can split the computing node after the check node, and remove it. The data generated by the divided computing node during calculation is stored in a storage space with low access latency, that is, the memory and/or cache on the computer device. Because the foregoing computing nodes have been split, the existing storage space on the computer device can meet the size of the preset storage space required by the split computing nodes during calculation.

S1044. Store the data generated during the calculation of the computing nodes after the current check node in a storage space with low access latency.

This embodiment relates to an application scenario where the result of the above comparison is that the size of the data generated by the computing node during calculation does not exceed the size of the preset storage space. In this application, the computer device can directly check the computing node after the node during the calculation. The generated data is stored in a storage space with low access latency. This step is the same as that described in the previous description of the delay optimization strategy. For details, please refer to the foregoing description, and the redundant description will not be repeated here.

Fig. 3 is a flowchart of an implementation manner of S103 in the embodiment of Fig. 2. As shown in Fig. 3, the “obtain current performance margin through the current check node” in S103 includes:

S201. Obtain the first total target calculation consumption time length and the total actual calculation consumption time length of all the computing nodes before the current check node.

Wherein, the first total target calculation consumption time period indicates the cumulative calculation consumption time length of all the calculation nodes expected by the user when all the calculation nodes are performing calculations before the current check node. The total actual consumption time refers to the actual calculation consumption time accumulated during the calculation of all the computing nodes before the current check node when the computer equipment runs to the current check node. When the computer device needs to obtain the current performance margin, it can first obtain the first total target calculation consumption time length and total actual calculation consumption time length of all computing nodes before the current check node, so as to calculate the total consumption time length and the total consumption time length according to the first total target. The total actual calculation consumption time determines the current performance margin.

S202: Determine the current performance margin according to the first total target calculation consumption time period and the total actual calculation consumption time period.

When the computer device obtains the first total target calculation consumption time and the total actual calculation consumption time, it can directly perform the difference calculation on the first total target calculation consumption time and the total actual calculation consumption time to obtain the current performance margin; optionally , It is also possible to weight the first total target calculation consumption time and the total actual calculation consumption time respectively and then perform the difference calculation to obtain the current performance margin.

Optionally, as shown in FIG. 4, a method of "obtaining the first total target calculation consumption time length of all computing nodes before the current check node" in S201 may specifically include:

S301. Obtain the second total target calculation consumption time length of all the computing nodes on the path where the current check node is located;

Wherein, the second total target calculation consumption time indicates the accumulated calculation consumption time expected by the user during calculation of all the computing nodes on the path where the current check node is located. When the computer device needs to obtain the first total target computing consumption time of all computing nodes before the current check node, it can first obtain the second total target computing consumption time of all computing nodes on the path where the current check node is located, and then according to The second total target calculation consumption time length determines the first total target calculation consumption time length.

S302. Determine the first total target calculation consumption time according to the second total target calculation consumption time and a preset ratio; the preset ratio is that the total calculation consumption time of all computing nodes before the current check node is on the path where the check node is located The percentage of the total computing time consumed by all computing nodes in the.

Among them, the preset ratio can be obtained in advance by a computer device, and can be specifically obtained by a variety of methods. For example, the computer device can pre-calculate the calculation amount of each computing node in the calculation graph, and then estimate each computing node based on the calculation amount of each calculation node Finally, the total calculation consumption time of the calculation node before the current check node and the total calculation consumption time of all the calculation nodes are used to determine the preset ratio. For another example, the computer equipment can also use the existing consumption time estimation model to estimate the calculation consumption time of each computing node, and then correspondingly pass the total calculation consumption time of the computing node before the current check node, and the total computing time of all computing nodes. The calculation consumes time and determines the preset ratio. Optionally, the preset ratio may also be determined by other methods, which is not limited in this embodiment. In this embodiment, when the computer device obtains the preset calculation consumption duration of all computing nodes on the path where the current check node is located, that is, the second total target calculation consumption duration, and the preset ratio, the second total target calculation consumption may be The duration and the preset ratio are multiplied to obtain the first total target calculation consumption duration. For example, if the second total target calculation consumption time is 10 hours, and the preset ratio is 1/2, the corresponding first total target calculation consumption time is 5 hours. Optionally, the computer device may also weight the second total target calculation consumption time with a preset ratio, and then perform a multiplication operation to obtain the first total target calculation consumption time.

Fig. 5 is a flowchart of an implementation manner of S102 in the embodiment of Fig. 2. As shown in Fig. 5, “insert at least one check node in the calculation graph” in S102 includes:

S401: Obtain a time-consuming ratio of a computing node on the longest path in the calculation graph.

In this embodiment, the computer device may first determine the longest path according to the layout mode of each computing node in the calculation graph, and then obtain the calculation consumption time of each computing node on the longest path, and then may base on the calculation consumption of each computing node The time length is calculated to obtain the ratio of the computing time consumed by the computing node on the longest path.

S402: Determine the insertion position of at least one check node on the longest path according to the proportion of the calculated consumption time.

In this embodiment, in order to balance the computing time consumed by each computing node on the longest path, when the computer device inserts a check node, it can determine the insertion position of at least one check node on the longest path according to the proportion of the computing time consumed when inserting the check node. The total computational consumption time of all computing nodes before and after the inserted check node is equal. Of course, it is also possible to make the total computing consumption time of all computing nodes before and after the check node inserted at this position not completely equal, so that the total of all computing nodes before the check node and all computing nodes after the check node It is sufficient to calculate the difference between the consumed time and the preset time.

S403. Insert at least one check node at the insertion position of the at least one check node.

When the computer device determines the insertion position of at least one check node on the longest path, it can insert at least one check node at the insertion position of the at least one check node, so as to optimize the computing node after the check node.

Optionally, based on the above-mentioned embodiment, this application provides a specific way for a computer device to obtain the proportion of computing time consumed. As shown in FIG. 6, the above-mentioned S401 "Obtain the proportion of computing time consumed by a computing node on the longest path in the calculation graph One method can specifically include:

S501: Obtain the calculation amount of each calculation node on the longest path.

When the computer equipment obtains the corresponding calculation graph by compiling the calculation network model, it can obtain the calculation amount of each calculation node in the calculation graph according to the calculation steps included in each calculation node. Therefore, in this embodiment, the computer device can first determine the longest path in the calculation graph, and then determine the computing nodes included in the longest path, and then can determine the calculation steps included in each computing node on the longest path and other information , Get the calculation amount of each computing node on the longest path.

S502: Obtain the calculation consumption time length of each calculation node on the longest path according to the calculation amount of each calculation node.

When the computer equipment obtains the calculation amount of each calculation node on the longest path, it can further estimate the calculation time of each calculation node according to the calculation amount of each calculation node, so as to obtain the calculation amount of each calculation node on the longest path. Calculate the elapsed time. The greater the amount of calculation of the foregoing computing nodes, the longer the estimated computing time of each computing node, the smaller the amount of computing of each computing node, and the smaller the estimated computing time of each computing node.

S503: Determine the proportion of the computing time consumed by each computing node on the longest path according to the computing time consumed by each computing node on the longest path.

When the computer device obtains the computing time consumed by each computing node on the longest path, it can perform a proportional calculation on the computing time consumed by each computing node to obtain the computing time consumed by each computing node on the longest path.

Optionally, the present application provides another specific way for the computer device to obtain the proportion of computing time consumed. As shown in FIG. 7, the above-mentioned S401 "Obtain the proportion of computing time consumed by each computing node on the longest path in the calculation graph" A method may specifically include:

S601. Construct a consumption time estimation model.

When the computer equipment needs to obtain the proportion of computing time consumed by each computing node on the longest path, it can first construct a time-consuming estimation model to analyze the calculation steps included in each computing node and estimate the calculation amount of each computing node. It should be noted that the above-mentioned time-consuming estimation model may be a pre-trained estimation model, which belongs to the prior art, and there is no detailed description of this.

S602. Use the consumption time estimation model to obtain the calculation consumption time of each computing node on the longest path.

When the computer equipment has completed the construction of the consumption time estimation model, the consumption time estimation model can be used to estimate the calculation consumption time of each calculation node on the longest path by analyzing the calculation steps of each computing node on the longest path .

S603: Determine the proportion of the computing time consumed by each computing node on the longest path according to the computing time consumed by each computing node on the longest path.

The content of the step S603 in this embodiment is the same as the content of the step S503. For details, please refer to the foregoing description, and the redundant description will not be repeated here. In an embodiment, a specific implementation manner of the above step S402 is also provided. As shown in FIG. 8, the above S402 "determine the insertion position of at least one check node on the longest path according to the proportion of the calculated consumption time" includes:

S701. Divide the longest path into a preset number of multiple sub-paths according to the proportion of the calculated consumption time.

Wherein, the preset number represents the number of pre-inserted check nodes on the longest path, and the preset number can be determined by the computer device in advance according to the length of the longest path or actual application requirements. In this embodiment, when the computer device determines the preset number of check nodes to be inserted, it can further divide the longest path into a preset number of sub-components by analyzing the proportion of computing time consumed by the computing nodes on the longest path. The path balances the computing time consumed by the computing nodes on each sub-path.

S702. Select at least one sub-path from the multiple sub-paths as the insertion position for inserting the checkpoint.

When the computer equipment divides the longest path into a preset number of multiple sub-paths, at least one sub-path can be selected from the multiple sub-paths as the insertion position of the insertion checkpoint, where all the computing nodes before the selected sub-path and those after the selected sub-path The total computing time of all computing nodes is as equal as possible to balance the computing time of each computing node on the longest path. Illustratively illustrate the method for determining the insertion position of the check node described in the embodiment of FIG. 8, as shown in the calculation graph of FIG. 8A (the calculation graph when the check node is not inserted), the calculation node on the longest path in the calculation graph 1. Computing node 2, and computing node 3. It is known that the computing time ratio of computing node 1, computing node 2, and computing node 3 is 1:5:6, then by analyzing the computing time ratio, the longest The path is evenly divided into 2 parts, so that the total computing time (6 hours) of computing node 1 and computing node 2 is equal to the computing time (6 hours) of computing node 3, then it can be specifically in computing node 2 and computing node 3. Insert check node a in between (as shown in Figure 8B).

Fig. 9 is a flowchart of another implementation manner of S102 in the embodiment of Fig. 2. As shown in Fig. 9, "insert at least one check node in the calculation graph" in S102 includes:

S801. Obtain a start-end computing node and an end-end computing node that are separated by at least one computing node in the calculation graph.

When there is a path across computing nodes in the calculation graph, the beginning and end computing nodes on the path can be obtained, so that check nodes can be inserted between the beginning and end computing nodes later. It should be noted that the number of cross-computing nodes may be one or multiple, which is not limited in this embodiment.

S802. Insert at least one check node in the middle position between the start-end computing node and the end-end computing node.

When the computer device determines the start-end computing node and the end-end computing node, at least one check node can be inserted in the middle of the start-end computing node and the end-end computing node. Illustratively illustrate the method of inserting at least one check node in the calculation graph described in the embodiment of FIG. 9, such as the calculation graph shown in FIG. 8A (the calculation graph when no check node is inserted), and the longest path in the calculation graph Compute node 1, compute node 2, and compute node 3, where compute node 1 and compute node 3 straddle compute node 2, then compute node 1 is the start-end compute node, and compute node 3 is the end compute node, which corresponds to A check node b is inserted between node 1 and computing node 3 (as shown in Figure 9A). FIG. 10 is a flowchart of an implementation manner of S101 in the embodiment of FIG. 2. As shown in FIG. 10, the "acquisition of the calculation diagram of the calculation network model" in the above S101 includes:

S901: Load the topological structure and parameters of the calculated network model.

In practical applications, the compiler of the computer equipment can obtain the calculation graph of the calculation network model by loading the topology structure and parameters of the calculation network model to compile.

S902. Compile the topological structure and parameters of the computational network model to obtain a computational graph of the computational network model.

When the compiler of the computer device loads the topological structure and parameters of the computational network model, it can be compiled to obtain the computational graph of the computational network model so as to optimize the computational consumption resources of the computational graph in the process of running the computational graph later.

In summary, this application also provides a calculation graph optimization method, as shown in FIG. 11, the method includes:

S1001. Load the topological structure and parameters of the calculated network model.

S1002, compile the topological structure and parameters of the computational network model, and obtain a computational graph of the computational network model.

S1003. Obtain a calculation time-consuming ratio of each calculation node on the longest path in the calculation graph.

S1004. Determine the insertion position of at least one check node on the longest path according to the proportion of the calculated consumption time.

S1005. Insert at least one inspection node at the insertion position of the at least one inspection node.

S1006. Obtain a start-end computing node and an end-end computing node that are separated by at least one computing node in the calculation graph.

S1007. Insert at least one check node in an intermediate position between the start-end computing node and the end-end computing node.

S1008. When running to each check node, obtain the current delay performance margin through the current check node.

S1009. Determine whether the current delay performance margin is sufficient. If the current delay performance margin is sufficient, perform step S1010; if the current delay performance margin is insufficient, perform step S1011.

S1010. Select a storage optimization strategy to optimize the computing nodes after the current check node; the storage optimization strategy includes: storing data generated during the calculation of the computing nodes after the current check node in a storage space with high access latency.

S1011, Obtain the size of the data generated by the computing node during the calculation after the current check node, and compare the size of the data generated by the computing node during the calculation with the size of the preset storage space, if the computing node generates the data during the calculation If the size of the data exceeds the size of the preset storage space, step S1012 is executed, and if the size of the data generated by the computing node during calculation does not exceed the size of the storage memory, step S1013 is executed.

S1012. Split the computing node after the current check node, and store the data generated during the calculation of the split computing node in a storage space with low access latency.

S1013. Store the data generated during the calculation of the computing node after the current check node in a storage space with low access latency.

In summary of all the above embodiments, the steps of the optimization method are illustrated by way of example. For example, the calculation graph shown in FIG. 8B is taken as an example for description, assuming that the calculation node 1, the calculation node 2, and the calculation node are on the longest path in FIG. 8B The total preset calculation consumption time of 3 is T, and the memory usage is M. The calculation time ratio of computing node 1, computing node 2, and computing node 3 estimated by the computer equipment through the calculation amount or the estimated time calculation model is 1:5:6, set the preset time threshold to th, then when the computer equipment runs to check node a, the actual calculation time of calculation node 1 and calculation node 2 before check node a is Tr, then the current delay performance The margin is T*6/12-Tr. If T*6/12-Tr>th, the storage optimization strategy is used to optimize the computing nodes after the check node. Specifically, the intermediate required for calculation in computing node 3 is optimized. The results or temporary variables are stored in the high-access latency storage space (off-chip GPU storage space or TPU storage space); if T*6/12-Tr≤th, the corresponding delay optimization strategy is used to optimize the computing nodes after the check node , Specifically store the intermediate results or temporary variables required for the calculation in the computing node 3 into the low-access latency storage space (memory or cache). Before storing, you can first determine the size of the memory space or cache space required for the computing node 3 calculation Is it larger than the existing memory space or cache space M, if it is larger, other delay optimization strategies should be adopted accordingly, such as splitting computing node 3, and storing intermediate results or temporary variables required for calculation in the split computing node Enter low access latency storage space (memory or cache).

It should be understood that although the various steps in the flowcharts of FIGS. 2-11 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in Figure 2-11 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. These sub-steps or stages The order of execution is not necessarily sequential.

In one embodiment, as shown in FIG. 12, an optimization device for a calculation graph is provided, including: a first acquisition module 11, an insertion mold module 12, a second acquisition module 13, a determination module 14, and an optimization module 15, wherein : The first obtaining module 11 is used to obtain the calculation graph of the calculation network model; the calculation graph includes a plurality of calculation nodes; the insertion module 12 is used to insert at least one check node in the calculation graph; the second obtaining module 13 is used to When running to each inspection node, the current performance margin is obtained through the current inspection node; the determining module 14 is used to determine the optimization strategy according to the current performance margin; the optimization module 15 is used to determine the current inspection node according to the optimization strategy The subsequent computing nodes are optimized. Regarding the specific definition of the optimization device of the calculation graph, please refer to the above definition of the optimization method of the calculation graph, which will not be repeated here. Each module in the optimization device of the above calculation graph can be implemented in whole or in part by software, hardware and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules. In one embodiment, a computer device is provided, including a memory and a processor, and a computer program is stored in the memory. When the processor executes the computer program, the following steps are implemented: obtaining a calculation graph of a calculation network model; A computing node; insert at least one check node in the calculation graph; when running to each check node, obtain the current delay performance margin through the current check node; determine the optimization strategy according to the current delay performance margin; according to the optimization The strategy optimizes the computing nodes after the current check node. The implementation principle and technical effect of a computer device provided by the foregoing embodiment are similar to those of the foregoing method embodiment, and will not be repeated here.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the following steps are also implemented: obtaining a calculation graph of a calculation network model; the calculation graph includes multiple calculations. Node; insert at least one check node in the calculation graph; when running to each check node, obtain the current performance margin through the current check node; determine the optimization strategy according to the current performance margin; determine the current check node according to the optimization strategy The subsequent computing nodes are optimized. The foregoing embodiment provides a computer-readable storage medium, and its implementation principle and technical effect are similar to those of the foregoing method embodiment, and will not be repeated here.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The various component embodiments of the present invention may be implemented by hardware, or by software modules running on one or more processors, or by a combination of them. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the computing processing device according to the embodiments of the present invention. The present invention can also be implemented as a device or device program (for example, a computer program and a computer program product) for executing part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may have the form of one or more signals. Such a signal can be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form. For example, FIG. 13 shows a computing processing device that can implement the method of the present invention. The computing processing device may be a computer device, which traditionally includes a processor 1010 and a computer program product in the form of a memory 1020 or a computer readable medium. The memory 1020 has a storage space 1030 for executing program codes 1031 of any method steps in the above methods. For example, the storage space 1030 for program codes may include various program codes 1031 respectively used to implement various steps in the above method. These program codes can be read from or written into one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards, or floppy disks. Such a computer program product is usually a portable or fixed storage unit as described with reference to FIG. 14. The storage unit may have storage segments, storage spaces, and the like arranged similarly to the memory 1020 in the computing processing device of FIG. 13. The program code can be compressed in an appropriate form, for example. Generally, the storage unit includes computer-readable code 1031', that is, code that can be read by a processor such as 1010, which, when run by a computing processing device, causes the computing processing device to execute the method described above. The various steps. The “one embodiment”, “an embodiment” or “one or more embodiments” referred to herein means that a specific feature, structure, or characteristic described in combination with the embodiment is included in at least one embodiment of the present invention. In addition, please note that the word examples "in one embodiment" herein do not necessarily all refer to the same embodiment. In the instructions provided here, a lot of specific details are explained. However, it can be understood that the embodiments of the present invention can be practiced without these specific details. In some instances, well-known methods, structures, and technologies are not shown in detail, so as not to obscure the understanding of this specification. In the claims, any reference signs placed between parentheses should not be constructed as a limitation to the claims. The word "comprising" does not exclude the presence of elements or steps not listed in the claims. The word "a" or "an" preceding an element does not exclude the presence of multiple such elements. The invention can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the unit claims listing several devices, several of these devices may be embodied in the same hardware item. The use of the words first, second, and third, etc. do not indicate any order. These words can be interpreted as names. The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification. The above-mentioned embodiments only express several implementation modes of the present invention, and their description is relatively specific and detailed, but they should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of the present invention, several modifications and improvements can be made, and these all fall within the protection scope of the present invention. Therefore, the protection scope of the patent of the present invention should be subject to the appended claims.

Claims

A method for optimizing a calculation graph, characterized in that the method includes:

Obtaining a calculation graph of a calculation network model; the calculation graph includes a plurality of calculation nodes;

Insert at least one check node in the calculation graph;

When running to each of the inspection nodes, obtain the current performance margin through the current inspection node;

Determine an optimization strategy according to the current performance margin;

Optimize computing nodes after the current check node according to the optimization strategy.
The method according to claim 1, wherein the current performance margin includes a current delay performance margin, and the determining an optimization strategy according to the current performance margin includes:

If the current delay performance margin is sufficient, determine a storage optimization strategy as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation;

If the current delay performance margin is insufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the time consumed by the computing node during calculation.
The method according to claim 1, wherein the current performance margin includes a current storage performance margin, and the determining an optimization strategy according to the current performance margin includes:

If the current storage performance margin is sufficient, determine the delay optimization strategy as the optimization strategy; the delay optimization strategy is used to reduce the time consumed by the computing node during calculation;

If the current storage performance margin is insufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation.
The method according to claim 2 or 3, wherein the storage optimization strategy comprises:

Storing the data generated by the computing node after the check node during calculation in a storage space with high access latency; the storage space with high access latency includes at least global memory and off-chip memory;

And/or, the delay optimization strategy includes: storing the data generated by the computing nodes after the check node during calculation in a storage space with low access delay; the storage space with low access delay includes at least a cache space and a slice Internal storage.
The method according to claim 4, wherein the delay optimization strategy further comprises:

Acquiring the size of the data generated by the computing node after the current check node during calculation;

Comparing the size of the data generated by the computing node during calculation with the size of the preset storage space;

If the size of the data generated by the computing node during the calculation exceeds the size of the preset storage space, the computing node after the current check node is split, and the computing node after the split is generated during the calculation. Storing data in the storage space with low access latency;

If the size of the data generated by the computing node during the calculation does not exceed the size of the preset storage space, then the data generated during the calculation by the computing node after the current check node is stored in the low access latency storage.
The method according to claim 2, wherein the obtaining the current performance margin through the current check node comprises:

Acquiring the first total target calculation consumption time length and the total actual calculation consumption time length of all computing nodes before the current check node;

Determine the current performance margin according to the first total target calculation consumption time length and the total actual calculation consumption time length.
The method according to claim 6, wherein the obtaining the first total target calculation consumption time length of all the computing nodes before the current check node comprises:

Acquiring the second total target calculation consumption time length of all computing nodes on the path where the current check node is located;

According to the second total target calculation consumption time and a preset ratio, determine the first total target calculation consumption time; the preset ratio is the total calculation consumption time of all computing nodes before the current check node. The proportion of the total computing time consumed by all computing nodes on the path where the check node is located.
The method according to claim 1, wherein the inserting at least one check node in the calculation graph comprises:

Acquiring a calculation time-consuming ratio of a calculation node on the longest path in the calculation graph;

Determining the insertion position of at least one of the check nodes on the longest path according to the proportion of the calculated time spent;

Insert at least one inspection node at the insertion position of the at least one inspection node.
8. The method according to claim 8, wherein the obtaining the calculation consumption time ratio of the calculation node on the longest path in the calculation graph comprises:

Acquiring the calculation amount of each computing node on the longest path;

Acquiring, according to the calculation amount of each of the computing nodes, the computing consumption time length of each computing node on the longest path;

Determine the proportion of the computing time consumed by each computing node on the longest path according to the computing time consumed by each computing node on the longest path.
The method according to claim 8, characterized in that said obtaining the calculation consumption time ratio of each calculation node on the longest path in the calculation graph comprises:

Build a model for estimating the consumption time;

Using the consumption time estimation model to obtain the calculation consumption time of each of the computing nodes on the longest path;

Determine the proportion of the computing time consumed by each of the computing nodes on the longest path according to the computing time consumed by each of the computing nodes on the longest path.
The method according to claim 9 or 10, wherein the determining the insertion position of at least one of the check nodes on the longest path according to the proportion of time consumed by the calculation comprises:

Dividing the longest path into a preset number of multiple sub-paths according to the proportion of the calculated consumption time;

At least one sub-path is selected from the plurality of sub-paths as an insertion position for inserting the check point.
The method according to claim 1 or 8, wherein the inserting at least one check node in the calculation graph comprises:

Acquiring a start-end computing node and an end-end computing node that are separated by at least one computing node in the calculation graph;

At least one check node is inserted in an intermediate position between the start-end computing node and the end-end computing node.
The method according to claim 1, wherein said obtaining a calculation graph of a calculation network model comprises:

Loading the topological structure and parameters of the computational network model;

Compiling the topological structure and parameters of the computational network model to obtain the computational graph of the computational network model.
A computing graph optimization device, characterized in that the device includes:

The first obtaining module is configured to obtain a calculation graph of a calculation network model; the calculation graph includes a plurality of calculation nodes;

Inserting module, for inserting at least one check node in the calculation graph;

The second obtaining module is used to obtain the current performance margin through the current check node when running to each of the check nodes;

The determining module is used to determine an optimization strategy according to the current performance margin;

The optimization module is used to optimize the computing nodes after the current check node according to the optimization strategy.
A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 13 when the computer program is executed by the processor.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the steps of the method according to any one of claims 1 to 13 when the computer program is executed by a processor.