WO2021114757A1 - Optimization method and apparatus for computation graph, computer device, and storage medium - Google Patents

Optimization method and apparatus for computation graph, computer device, and storage medium Download PDF

Info

Publication number
WO2021114757A1
WO2021114757A1 PCT/CN2020/113290 CN2020113290W WO2021114757A1 WO 2021114757 A1 WO2021114757 A1 WO 2021114757A1 CN 2020113290 W CN2020113290 W CN 2020113290W WO 2021114757 A1 WO2021114757 A1 WO 2021114757A1
Authority
WO
WIPO (PCT)
Prior art keywords
calculation
computing
node
current
check node
Prior art date
Application number
PCT/CN2020/113290
Other languages
French (fr)
Chinese (zh)
Inventor
周舒畅
王田
Original Assignee
北京迈格威科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京迈格威科技有限公司 filed Critical 北京迈格威科技有限公司
Publication of WO2021114757A1 publication Critical patent/WO2021114757A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of computer technology, and in particular to a method, device, computer equipment, and storage medium for optimizing a calculation graph.
  • the optimization process of the existing computing network model is to use a unified optimization method to optimize the computing network model, that is, to design the computing network model and the application environment for the computing network model and the combined application environment according to the hardware index requirements proposed by the user.
  • the corresponding optimization model enables the computer equipment resources consumed when the optimization model is compiled and run in the later stage to meet the performance index requirements proposed by the user.
  • the above optimization method can only be applied to the same computing network model and application environment.
  • the computing network model and application environment are changed, the corresponding optimization method needs to be redesigned. Therefore, the adaptability of the above optimization method is extremely low, which leads to The operating efficiency of the computational network model is extremely low.
  • a method for optimizing a calculation graph includes:
  • the calculation graph includes multiple calculation nodes
  • the computing nodes after the current check node are optimized.
  • the current performance margin includes the current delay performance margin.
  • determining the optimization strategy includes: if the current delay performance margin is sufficient, determining the storage optimization strategy as the optimization strategy; The storage optimization strategy is used to reduce the memory occupied by the computing node during the calculation; if the current delay performance margin is not sufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the calculation of the computing node during the calculation It takes a long time.
  • the current performance margin includes the current storage performance margin
  • the optimization strategy is determined according to the current delay performance margin, including: if the current storage performance margin is sufficient, determining the delay optimization strategy as the optimization strategy ; The delay optimization strategy is used to reduce the time consumed by the computing node in the calculation; if the current storage performance margin is not sufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node in the calculation .
  • the storage optimization strategy includes: storing the data generated by the computing node after the check node during calculation in a storage space with high access latency; the storage space with high access latency includes at least global memory and off-chip memory; And/or, the delay optimization strategy includes: storing the data generated by the computing node after the check node during calculation in a storage space with low access latency; the storage space with low access latency includes at least cache space and on-chip storage.
  • the delay optimization strategy further includes: obtaining the size of the data generated by the computing node after the check node during the calculation; comparing the size of the data generated by the computing node during the calculation with the size of the preset storage memory ; If the size of the data generated by the computing node during the calculation exceeds the preset storage memory size, the computing node after the check node will be split, and the data generated by the split computing node during the calculation will be stored in the storage with low access latency Space; if the size of the data generated by the computing node during the calculation does not exceed the preset storage memory size, the data generated during the calculation by the computing node after the check node is stored in a storage space with low access latency.
  • obtaining the current performance margin through the current check node includes: obtaining the first total target calculation consumption time length and the total actual calculation consumption time length of all computing nodes before the current check node; according to the first total target Calculate the consumption time and the total actual calculation consumption time to determine the current delay performance margin.
  • obtaining the first total target calculation consumption time length of all computing nodes before the current check node includes: obtaining the second total target calculation consumption time length of all the computing nodes on the path where the current check node is located; The second total target calculation consumption time and the preset ratio are used to determine the first total target calculation consumption time; the preset ratio is the total calculation consumption time of all the computing nodes before the current check node and the total calculation consumption time of all the computing nodes on the path where the check node is located. The percentage of the total computing time consumed.
  • inserting at least one check node in the calculation graph includes: obtaining the calculation consumption time ratio of each calculation node on the longest path in the calculation graph; and determining at least one check node on the longest path according to the calculation consumption time ratio Insertion position of the inspection node; insert at least one inspection node at the insertion position of the at least one inspection node.
  • obtaining the proportion of computing time consumed by each computing node on the longest path in the calculation graph includes: obtaining the calculation amount of each computing node on the longest path; obtaining the calculation amount of each computing node on the longest path The computing time consumed by each computing node; according to the computing time consumed by each computing node on the longest path, the ratio of computing time consumed by each computing node on the longest path is determined.
  • obtaining the calculation consumption time ratio of each computing node on the longest path in the calculation graph includes: constructing a consumption time estimation model; using the consumption time estimation model to obtain the calculation of each computing node on the longest path Time consumption: According to the calculation time consumption of each computing node on the longest path, the ratio of the calculation time consumption of each computing node on the longest path is determined.
  • determining the insertion position of at least one check node on the longest path according to the calculated consumption time ratio includes: dividing the longest path into a preset number of multiple sub-paths according to the calculated consumption time ratio; At least one of the multiple sub-paths is selected as the insertion position for inserting the checkpoint.
  • inserting at least one check node in the calculation graph includes: obtaining a start-end calculation node and an end calculation node that are separated by at least one calculation node in the calculation graph; inserting at least one check node in the middle position between the start-end calculation node and the end calculation node A check node.
  • obtaining the calculation graph of the calculation network model includes: loading the topology structure and parameters of the calculation network model; compiling the topology structure and parameters of the calculation network model to obtain the calculation graph of the calculation network model.
  • an optimization device for a calculation graph includes:
  • the first obtaining module is used to obtain a calculation graph of a calculation network model; the calculation graph includes a plurality of calculation nodes;
  • Insert module used to insert at least one check node in the calculation graph
  • the second obtaining module is used to obtain the current performance margin through the current check node when running to each check node;
  • the determining module is used to determine the optimization strategy according to the current performance margin
  • the optimization module is used to optimize the computing nodes after the current check node according to the optimization strategy.
  • a computer device in a third aspect, includes a memory and a processor, the memory stores a computer program, and the processor implements the calculation graph optimization method described in any embodiment of the first aspect when the processor executes the computer program.
  • a computer-readable storage medium has a computer program stored thereon, and the computer program, when executed by a processor, implements the method for optimizing the calculation graph described in any one of the embodiments of the first aspect.
  • the method, device, computer equipment, and storage medium for optimizing calculation graphs obtained in this application obtain a calculation graph of a calculation network model including multiple calculation nodes, and then insert at least one check node in the calculation graph, and when it runs to At each check node, the current performance margin is obtained through the current check node, and then an optimization strategy is determined according to the current performance margin, and the required consumption resources of the computing nodes after the current check node are optimized according to the optimization strategy.
  • the above optimization method obtains the current performance margin of the computer equipment when it runs to each check node by inserting the check node, and then selects the optimization strategy that meets the actual operation of the computer equipment according to the current performance margin to consume the computing node after the check node Resources are optimized, so that when the computer equipment is running to the above-mentioned computing node, the resource usage of each computing node in the calculation graph can be dynamically adjusted to meet the performance index requirements of the user for the calculation graph, and to improve the computer equipment Resource utilization.
  • FIG. 1 is a schematic diagram of the internal structure of a computer device provided by an embodiment
  • FIG. 2 is a flowchart of a method for optimizing a calculation graph according to an embodiment
  • 2A is a flowchart of a method for optimizing a calculation graph provided by an embodiment
  • FIG. 3 is a flowchart of an implementation manner of S103 in the embodiment of FIG. 2;
  • FIG. 4 is a flowchart of an implementation manner of S201 in the embodiment of FIG. 3;
  • FIG. 5 is a flowchart of an implementation manner of S102 in the embodiment of FIG. 2;
  • FIG. 6 is a flowchart of an implementation manner of S401 in the embodiment of FIG. 5;
  • FIG. 7 is a flowchart of another implementation manner of S401 in the embodiment of FIG. 5;
  • FIG. 8 is a flowchart of another implementation manner of S402 in the embodiment of FIG. 5;
  • FIG. 8A is a schematic structural diagram of a calculation graph provided by an embodiment
  • FIG. 8B is a schematic structural diagram of a calculation graph provided by an embodiment
  • FIG. 9 is a flowchart of another implementation manner of S102 in the embodiment of FIG. 2;
  • FIG. 9A is a schematic structural diagram of a calculation graph provided by an embodiment
  • Fig. 10 is a flowchart of an implementation manner of S101 in the embodiment of Fig. 2;
  • FIG. 11 is a flowchart of a method for optimizing a calculation graph provided by an embodiment
  • FIG. 12 is a schematic structural diagram of an optimization device for a calculation graph provided by an embodiment
  • Fig. 13 schematically shows a block diagram of a computing processing device for executing the method according to the present invention
  • Fig. 14 schematically shows a storage unit for holding or carrying program codes for implementing the method according to the present invention.
  • the method for optimizing the calculation graph provided in this application can be applied to the computer device as shown in FIG. 1.
  • the computer device may be a server or a terminal, and its internal structure diagram may be as shown in FIG.
  • the computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize an optimization method of the calculation graph.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
  • FIG. 1 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • Fig. 2 is a flowchart of a method for optimizing a calculation graph provided by an embodiment.
  • the execution body of the method is the computer device in Fig. 1, and the method involves the computer device running the calculation graph of the calculation network model.
  • the specific process of optimization of the calculation graph As shown in Figure 2, the method specifically includes the following steps:
  • the computing network model may be constructed by computer equipment in advance according to actual application requirements, and it may specifically be a computing network model with various functional applications, such as a neural network model, a machine learning network model, and an intelligent algorithm network model.
  • Computational graphs are a kind of "language” for describing calculation methods, which are specifically composed of multiple computing nodes, and multiple computing nodes with dependencies are connected to each other.
  • the computing node may include code for executing a certain computing function, so that when the computer device runs to the computing node, it can execute the corresponding computing task in the computing network model.
  • the computer device can compile a preset calculation network model through a compiler to generate a compiled calculation graph.
  • the computer device may also directly obtain the calculation graph of the calculation network model after the compilation meeting through other methods, which is not limited in this embodiment.
  • the computer device may also construct the computational network model according to actual application requirements in advance, and then compile the computational network model based on the constructed computational network model for later use.
  • the computer device can also directly obtain a pre-compiled computing network model, and then compile it based on the obtained computing network model for later use during runtime, which is not limited in this embodiment.
  • the check node may include code for executing a certain calculation or test function, so that when the computer device runs to the check node, the corresponding calculation or test task can be executed, and the check node may be pre-configured by the computing device.
  • the computer device obtains the calculation graph of the calculation network model and needs to optimize the calculation graph in the process of running the calculation graph later, at least one check node may be further inserted into the calculation graph to make
  • the computer device can detect the resource consumption of the computer device at the current moment, so as to dynamically adjust the resource utilization mode of the subsequent calculation node according to the current resource consumption, so that the calculation graph is in the process of being executed In the process, the consumption of resources can always meet the performance indicators proposed by the user or reach the optimum, making full use of the resources on the computer equipment.
  • the current performance margin represents the margin between the computing device resources actually consumed when the computing device runs to the current check node and the computing device resources indicated by the user's expected performance index
  • the current performance margin may represent delay performance
  • the performance margin of the index may also be the performance margin of the storage performance index, or the performance margin of other types of performance indexes consumed by the computer equipment during calculation.
  • the computer device can obtain the current performance margin of the computing device by executing the code on the check node, so that the computer device can then obtain the current performance margin according to the current performance margin.
  • the computing nodes after the checkpoint are optimized in different ways, so that the optimized computing nodes can make full use of the resources of the computing device when they are executed.
  • S104 Determine an optimization strategy according to the current performance margin.
  • the optimization strategy is used to optimize the resources consumed by the computing nodes after the check node, so that the resources consumed by the subsequent computing nodes when executed can meet user needs or match the performance indicators of the computing device.
  • the computer device when the computer device obtains the current performance margin through the current check node, it can determine whether the current performance margin is sufficient, and then select different optimization strategies according to the judgment result to dynamically optimize the calculation after the check node in the graph. calculate node.
  • the optimization strategy of reducing memory can be used to compile and run to reduce the performance consumption of the computer equipment during storage, so that the performance indicators of all aspects of the computer equipment can meet the user Demand
  • the optimization strategy of reducing the memory access operation with high access delay can be used to compile and run, so as to reduce the delay performance consumption of computer equipment calculation. So that the performance indicators of all aspects of computer equipment can meet the needs of users.
  • the optimization strategy of reducing the memory access operation with high access latency can be used to compile and run, so as to reduce the delay performance consumption of computer equipment calculation; if it indicates that the storage The current performance margin of the performance index is sufficient to reduce the performance consumption of the computer equipment during storage.
  • the parameters or variables on the computing node after the current check node can be optimized according to the optimization strategy.
  • the computer device can change the parameters or variables on the computing node.
  • the storage method of parameters or variables thereby changing the length of time when the computing node reads or writing data, and then changes the computing time when the computing device runs to the computing node, so as to improve the delay performance of the computing device and complete the calculation Optimization of nodes.
  • a computer device can also split a computing node, so that the resources consumed by one computing node are divided into resources consumed by multiple computing nodes, so as to reduce the burden of resource consumption when the computing device runs to each computing node, and complete the computing node Optimization.
  • the method for optimizing the calculation graph obtained in this embodiment obtains a calculation graph of a calculation network model including multiple calculation nodes, inserts at least one check node in the calculation graph, and when it runs to each check node, passes the current
  • the check node obtains the current performance margin, and then determines an optimization strategy according to the current performance margin, and optimizes the required consumption resources of the computing nodes after the current check node according to the optimization strategy.
  • the above optimization method obtains the current performance margin of the computer equipment when it runs to each check node by inserting the check node, and then selects the optimization strategy that meets the actual operation of the computer equipment according to the current performance margin to consume the computing node after the check node Resources are optimized, so that when the computer equipment is running to the above-mentioned computing node, the resource usage of each computing node in the calculation graph can be dynamically adjusted to meet the performance index requirements of the user for the calculation graph, and to improve the computer equipment Resource utilization.
  • the foregoing current performance margin includes a performance margin representing a delay performance index, that is, the current delay performance margin.
  • this application provides an implementation manner of the foregoing S104. Including: if the current delay performance margin is sufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation.
  • This embodiment relates to an application scenario in which a computer device obtains a sufficient margin of current delay performance, indicating that the computing device requires a sufficient amount of time for calculation at this time, and it can also meet the calculation requirements of later computing nodes.
  • the computer The device does not need to pay attention to the computing time consumed by the computing node during the calculation, but can focus on the memory occupied by the computing node during the calculation to optimize the memory resources on the computing device and prevent the computer device from being occupied too much, thereby affecting the computer device
  • the calculation performance which in turn affects the calculation speed at which the calculation graph is executed.
  • the aforementioned storage optimization strategy may specifically include: storing data generated by the computing node after the check node during calculation in a storage space with high access latency; the storage space with high access latency includes at least global memory and off-chip memory.
  • the data generated by the calculation node during calculation may include intermediate results and temporary variables required in the calculation.
  • the computer device optimizes the computing node after the check node according to the storage optimization strategy, it can specifically store the data generated by the computing node after the check node in the storage space with high access latency, for example, the global memory of the GPU or the TPU. Off-chip storage, etc., to reduce the occupancy rate of the computer equipment memory, thereby increasing the computing speed of the computer equipment.
  • the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the computational time spent by the computing node during calculation.
  • This embodiment relates to an application scenario where the current delay performance margin is not sufficient for the computer equipment to obtain, which indicates that the computing time required by the computer equipment at this time is relatively tight, and may not be able to meet the computing requirements of the later computing nodes.
  • Computer equipment needs to focus on the computing time consumed by computing nodes during calculations. You can ignore the memory occupied by computing nodes during computing to optimize the delay performance of computing equipment and avoid computing nodes from consuming too long time during computing. , Thereby affecting the calculation speed of the calculation graph is executed.
  • the above-mentioned delay optimization strategy may specifically include: storing data generated by the computing node after the check node during calculation in a low-access-latency storage space; the low-access-latency storage space includes at least cache space and on-chip storage.
  • the computer device when the computer device optimizes the computing node after the check node according to the delay optimization strategy, it can specifically store the data generated by the computing node after the check node in the storage space with low access latency, for example, The memory or cache of the computer equipment, etc., to reduce the time for the computing node to access the storage space during calculation, thereby increasing the computing speed of the computing node, and then increasing the computing speed of the computer equipment.
  • the foregoing current performance margin includes a performance margin representing a storage performance index, that is, the current storage performance margin.
  • this application provides an implementation manner of the foregoing S104, and the method includes: If the current storage performance margin is sufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the time consumed by the computing node during calculation.
  • This embodiment relates to an application scenario in which a computer device obtains a sufficient current storage performance margin, indicating that the memory resource required by the computer device at this time is relatively sufficient, and it can also meet the computing needs of the later computing node.
  • the computer device needs Focus on the computational time consumed by the computing node during the calculation. You can ignore the memory occupied by the computing node during the calculation to optimize the delay performance of the computing device and avoid the computing node from spending too much time during the calculation, which affects the calculation The calculation speed at which the graph is executed.
  • the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation.
  • This embodiment relates to an application scenario where a computer device obtains the current storage performance margin is insufficient, indicating that the memory resource required by the computer device at this time is relatively tight, and may not be able to meet the computing needs of the later computing node.
  • the computer The device needs to focus on the memory occupied by the computing node during the calculation. You can ignore the time consumption of the computing node during the calculation to optimize the storage performance of the computing device and avoid the computing node from occupying too much memory during the calculation, which affects the calculation graph. The speed of calculation performed.
  • the delay optimization strategy may further include:
  • This embodiment is applicable to an application scenario when the memory or cache on a computer device cannot meet the memory or cache required by the computing node for calculation.
  • the optimization strategy determined by the computer device is a delay optimization strategy
  • the data size generated by the computing node after the current check node during calculation can be obtained first, so as to estimate whether the memory or cache on the computer device meets the computing requirements based on the data size.
  • step S1042. Compare the size of the data generated by the computing node during calculation with the size of the preset storage space. If the size of the data generated by the computing node during calculation exceeds the size of the preset storage space, step S1043 is executed. If the computing node The size of the data generated during the calculation does not exceed the size of the preset storage space, then step S1044 is executed.
  • the computer device when it obtains the size of the data generated by the computing node during calculation, it can further compare the size of the data generated by the computing node during calculation with the size of the preset storage space to obtain a comparison result. Then, according to the comparison result, different delay optimization strategies are selected to optimize the computing nodes after the current check node.
  • the aforementioned preset storage space may be a storage space with low access latency, such as the memory and/or cache space of a computer device.
  • the above comparison results include: the size of the data generated by the computing node during the calculation exceeds the size of the preset storage space. At this time, it indicates that the existing storage space on the computer device cannot meet the computing requirements of the computing node, and the computing node generated during the calculation. The size of the data does not exceed the size of the preset storage space, which indicates that the existing storage space on the computer device is still relatively abundant, which can meet the computing requirements of the computing node.
  • S1043 Split the computing node after the current check node, and store the data generated during the calculation of the split computing node in a storage space with low access latency.
  • This embodiment relates to an application scenario in which the above-mentioned comparison result is that the size of the data generated by the computing node during calculation exceeds the size of the preset storage space.
  • the computer device can split the computing node after the check node, and remove it.
  • the data generated by the divided computing node during calculation is stored in a storage space with low access latency, that is, the memory and/or cache on the computer device. Because the foregoing computing nodes have been split, the existing storage space on the computer device can meet the size of the preset storage space required by the split computing nodes during calculation.
  • This embodiment relates to an application scenario where the result of the above comparison is that the size of the data generated by the computing node during calculation does not exceed the size of the preset storage space.
  • the computer device can directly check the computing node after the node during the calculation.
  • the generated data is stored in a storage space with low access latency. This step is the same as that described in the previous description of the delay optimization strategy. For details, please refer to the foregoing description, and the redundant description will not be repeated here.
  • Fig. 3 is a flowchart of an implementation manner of S103 in the embodiment of Fig. 2. As shown in Fig. 3, the “obtain current performance margin through the current check node” in S103 includes:
  • the first total target calculation consumption time period indicates the cumulative calculation consumption time length of all the calculation nodes expected by the user when all the calculation nodes are performing calculations before the current check node.
  • the total actual consumption time refers to the actual calculation consumption time accumulated during the calculation of all the computing nodes before the current check node when the computer equipment runs to the current check node.
  • the computer device needs to obtain the current performance margin, it can first obtain the first total target calculation consumption time length and total actual calculation consumption time length of all computing nodes before the current check node, so as to calculate the total consumption time length and the total consumption time length according to the first total target.
  • the total actual calculation consumption time determines the current performance margin.
  • S202 Determine the current performance margin according to the first total target calculation consumption time period and the total actual calculation consumption time period.
  • the computer device When the computer device obtains the first total target calculation consumption time and the total actual calculation consumption time, it can directly perform the difference calculation on the first total target calculation consumption time and the total actual calculation consumption time to obtain the current performance margin; optionally , It is also possible to weight the first total target calculation consumption time and the total actual calculation consumption time respectively and then perform the difference calculation to obtain the current performance margin.
  • a method of "obtaining the first total target calculation consumption time length of all computing nodes before the current check node" in S201 may specifically include:
  • the second total target calculation consumption time indicates the accumulated calculation consumption time expected by the user during calculation of all the computing nodes on the path where the current check node is located.
  • the computer device needs to obtain the first total target computing consumption time of all computing nodes before the current check node, it can first obtain the second total target computing consumption time of all computing nodes on the path where the current check node is located, and then according to The second total target calculation consumption time length determines the first total target calculation consumption time length.
  • S302. Determine the first total target calculation consumption time according to the second total target calculation consumption time and a preset ratio; the preset ratio is that the total calculation consumption time of all computing nodes before the current check node is on the path where the check node is located The percentage of the total computing time consumed by all computing nodes in the.
  • the preset ratio can be obtained in advance by a computer device, and can be specifically obtained by a variety of methods.
  • the computer device can pre-calculate the calculation amount of each computing node in the calculation graph, and then estimate each computing node based on the calculation amount of each calculation node Finally, the total calculation consumption time of the calculation node before the current check node and the total calculation consumption time of all the calculation nodes are used to determine the preset ratio.
  • the computer equipment can also use the existing consumption time estimation model to estimate the calculation consumption time of each computing node, and then correspondingly pass the total calculation consumption time of the computing node before the current check node, and the total computing time of all computing nodes. The calculation consumes time and determines the preset ratio.
  • the preset ratio may also be determined by other methods, which is not limited in this embodiment.
  • the second total target calculation consumption may be The duration and the preset ratio are multiplied to obtain the first total target calculation consumption duration. For example, if the second total target calculation consumption time is 10 hours, and the preset ratio is 1/2, the corresponding first total target calculation consumption time is 5 hours.
  • the computer device may also weight the second total target calculation consumption time with a preset ratio, and then perform a multiplication operation to obtain the first total target calculation consumption time.
  • Fig. 5 is a flowchart of an implementation manner of S102 in the embodiment of Fig. 2. As shown in Fig. 5, “insert at least one check node in the calculation graph” in S102 includes:
  • the computer device may first determine the longest path according to the layout mode of each computing node in the calculation graph, and then obtain the calculation consumption time of each computing node on the longest path, and then may base on the calculation consumption of each computing node The time length is calculated to obtain the ratio of the computing time consumed by the computing node on the longest path.
  • S402 Determine the insertion position of at least one check node on the longest path according to the proportion of the calculated consumption time.
  • the computer device in order to balance the computing time consumed by each computing node on the longest path, when the computer device inserts a check node, it can determine the insertion position of at least one check node on the longest path according to the proportion of the computing time consumed when inserting the check node.
  • the total computational consumption time of all computing nodes before and after the inserted check node is equal.
  • the computer device determines the insertion position of at least one check node on the longest path, it can insert at least one check node at the insertion position of the at least one check node, so as to optimize the computing node after the check node.
  • this application provides a specific way for a computer device to obtain the proportion of computing time consumed.
  • the above-mentioned S401 "Obtain the proportion of computing time consumed by a computing node on the longest path in the calculation graph
  • One method can specifically include:
  • the computer equipment When the computer equipment obtains the corresponding calculation graph by compiling the calculation network model, it can obtain the calculation amount of each calculation node in the calculation graph according to the calculation steps included in each calculation node. Therefore, in this embodiment, the computer device can first determine the longest path in the calculation graph, and then determine the computing nodes included in the longest path, and then can determine the calculation steps included in each computing node on the longest path and other information , Get the calculation amount of each computing node on the longest path.
  • the computer equipment When the computer equipment obtains the calculation amount of each calculation node on the longest path, it can further estimate the calculation time of each calculation node according to the calculation amount of each calculation node, so as to obtain the calculation amount of each calculation node on the longest path. Calculate the elapsed time.
  • the greater the amount of calculation of the foregoing computing nodes the longer the estimated computing time of each computing node, the smaller the amount of computing of each computing node, and the smaller the estimated computing time of each computing node.
  • S503 Determine the proportion of the computing time consumed by each computing node on the longest path according to the computing time consumed by each computing node on the longest path.
  • the computer device When the computer device obtains the computing time consumed by each computing node on the longest path, it can perform a proportional calculation on the computing time consumed by each computing node to obtain the computing time consumed by each computing node on the longest path.
  • the present application provides another specific way for the computer device to obtain the proportion of computing time consumed.
  • the above-mentioned S401 "Obtain the proportion of computing time consumed by each computing node on the longest path in the calculation graph"
  • a method may specifically include:
  • the computer equipment When the computer equipment needs to obtain the proportion of computing time consumed by each computing node on the longest path, it can first construct a time-consuming estimation model to analyze the calculation steps included in each computing node and estimate the calculation amount of each computing node.
  • a time-consuming estimation model may be a pre-trained estimation model, which belongs to the prior art, and there is no detailed description of this.
  • S602. Use the consumption time estimation model to obtain the calculation consumption time of each computing node on the longest path.
  • the consumption time estimation model can be used to estimate the calculation consumption time of each calculation node on the longest path by analyzing the calculation steps of each computing node on the longest path .
  • S603 Determine the proportion of the computing time consumed by each computing node on the longest path according to the computing time consumed by each computing node on the longest path.
  • step S603 in this embodiment is the same as the content of the step S503.
  • step S402 determines the insertion position of at least one check node on the longest path according to the proportion of the calculated consumption time.
  • the preset number represents the number of pre-inserted check nodes on the longest path
  • the preset number can be determined by the computer device in advance according to the length of the longest path or actual application requirements.
  • the computer device determines the preset number of check nodes to be inserted, it can further divide the longest path into a preset number of sub-components by analyzing the proportion of computing time consumed by the computing nodes on the longest path. The path balances the computing time consumed by the computing nodes on each sub-path.
  • At least one sub-path can be selected from the multiple sub-paths as the insertion position of the insertion checkpoint, where all the computing nodes before the selected sub-path and those after the selected sub-path
  • the total computing time of all computing nodes is as equal as possible to balance the computing time of each computing node on the longest path.
  • Fig. 9 is a flowchart of another implementation manner of S102 in the embodiment of Fig. 2. As shown in Fig. 9, "insert at least one check node in the calculation graph" in S102 includes:
  • the beginning and end computing nodes on the path can be obtained, so that check nodes can be inserted between the beginning and end computing nodes later.
  • the number of cross-computing nodes may be one or multiple, which is not limited in this embodiment.
  • At least one check node can be inserted in the middle of the start-end computing node and the end-end computing node. Illustratively illustrate the method of inserting at least one check node in the calculation graph described in the embodiment of FIG. 9, such as the calculation graph shown in FIG.
  • FIG. 10 is a flowchart of an implementation manner of S101 in the embodiment of FIG. 2. As shown in FIG. 10, the "acquisition of the calculation diagram of the calculation network model" in the above S101 includes:
  • the compiler of the computer equipment can obtain the calculation graph of the calculation network model by loading the topology structure and parameters of the calculation network model to compile.
  • the compiler of the computer device loads the topological structure and parameters of the computational network model, it can be compiled to obtain the computational graph of the computational network model so as to optimize the computational consumption resources of the computational graph in the process of running the computational graph later.
  • this application also provides a calculation graph optimization method, as shown in FIG. 11, the method includes:
  • step S1009 Determine whether the current delay performance margin is sufficient. If the current delay performance margin is sufficient, perform step S1010; if the current delay performance margin is insufficient, perform step S1011.
  • the storage optimization strategy includes: storing data generated during the calculation of the computing nodes after the current check node in a storage space with high access latency.
  • step S1011 Obtain the size of the data generated by the computing node during the calculation after the current check node, and compare the size of the data generated by the computing node during the calculation with the size of the preset storage space, if the computing node generates the data during the calculation If the size of the data exceeds the size of the preset storage space, step S1012 is executed, and if the size of the data generated by the computing node during calculation does not exceed the size of the storage memory, step S1013 is executed.
  • the steps of the optimization method are illustrated by way of example.
  • the calculation graph shown in FIG. 8B is taken as an example for description, assuming that the calculation node 1, the calculation node 2, and the calculation node are on the longest path in FIG. 8B
  • the total preset calculation consumption time of 3 is T
  • the memory usage is M.
  • the calculation time ratio of computing node 1, computing node 2, and computing node 3 estimated by the computer equipment through the calculation amount or the estimated time calculation model is 1:5:6, set the preset time threshold to th, then when the computer equipment runs to check node a, the actual calculation time of calculation node 1 and calculation node 2 before check node a is Tr, then the current delay performance
  • the margin is T*6/12-Tr.
  • the storage optimization strategy is used to optimize the computing nodes after the check node. Specifically, the intermediate required for calculation in computing node 3 is optimized. The results or temporary variables are stored in the high-access latency storage space (off-chip GPU storage space or TPU storage space); if T*6/12-Tr ⁇ th, the corresponding delay optimization strategy is used to optimize the computing nodes after the check node , Specifically store the intermediate results or temporary variables required for the calculation in the computing node 3 into the low-access latency storage space (memory or cache).
  • an optimization device for a calculation graph including: a first acquisition module 11, an insertion mold module 12, a second acquisition module 13, a determination module 14, and an optimization module 15, wherein :
  • the first obtaining module 11 is used to obtain the calculation graph of the calculation network model;
  • the calculation graph includes a plurality of calculation nodes;
  • the insertion module 12 is used to insert at least one check node in the calculation graph;
  • the second obtaining module 13 is used to When running to each inspection node, the current performance margin is obtained through the current inspection node;
  • the determining module 14 is used to determine the optimization strategy according to the current performance margin;
  • the optimization module 15 is used to determine the current inspection node according to the optimization strategy The subsequent computing nodes are optimized.
  • Each module in the optimization device of the above calculation graph can be implemented in whole or in part by software, hardware and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device including a memory and a processor, and a computer program is stored in the memory.
  • the processor executes the computer program, the following steps are implemented: obtaining a calculation graph of a calculation network model; A computing node; insert at least one check node in the calculation graph; when running to each check node, obtain the current delay performance margin through the current check node; determine the optimization strategy according to the current delay performance margin; according to the optimization The strategy optimizes the computing nodes after the current check node.
  • the implementation principle and technical effect of a computer device provided by the foregoing embodiment are similar to those of the foregoing method embodiment, and will not be repeated here.
  • a computer-readable storage medium on which a computer program is stored.
  • the following steps are also implemented: obtaining a calculation graph of a calculation network model; the calculation graph includes multiple calculations. Node; insert at least one check node in the calculation graph; when running to each check node, obtain the current performance margin through the current check node; determine the optimization strategy according to the current performance margin; determine the current check node according to the optimization strategy
  • the subsequent computing nodes are optimized.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • the various component embodiments of the present invention may be implemented by hardware, or by software modules running on one or more processors, or by a combination of them.
  • a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the computing processing device according to the embodiments of the present invention.
  • DSP digital signal processor
  • the present invention can also be implemented as a device or device program (for example, a computer program and a computer program product) for executing part or all of the methods described herein.
  • Such a program for realizing the present invention may be stored on a computer-readable medium, or may have the form of one or more signals.
  • FIG. 13 shows a computing processing device that can implement the method of the present invention.
  • the computing processing device may be a computer device, which traditionally includes a processor 1010 and a computer program product in the form of a memory 1020 or a computer readable medium.
  • the memory 1020 has a storage space 1030 for executing program codes 1031 of any method steps in the above methods.
  • the storage space 1030 for program codes may include various program codes 1031 respectively used to implement various steps in the above method. These program codes can be read from or written into one or more computer program products.
  • Such computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards, or floppy disks.
  • Such a computer program product is usually a portable or fixed storage unit as described with reference to FIG. 14.
  • the storage unit may have storage segments, storage spaces, and the like arranged similarly to the memory 1020 in the computing processing device of FIG. 13.
  • the program code can be compressed in an appropriate form, for example.
  • the storage unit includes computer-readable code 1031', that is, code that can be read by a processor such as 1010, which, when run by a computing processing device, causes the computing processing device to execute the method described above. The various steps.

Abstract

An optimization method and apparatus for a computation graph, a computer device, and a storage medium. The method comprises: obtaining a computation graph of a computation network model; then inserting at least one check node in the computation graph; when running to each check node, obtaining the current performance margin by means of the current check node; then determining an optimization policy according to the current performance margin; and according to the optimization policy, optimizing a resource that needs to be consumed by a computation node after the current check node. According to the optimization method, by inserting the check nodes, obtaining the current performance margin of a computer device when the computer device runs to each check node, and according to the current performance margin, selecting the optimization policy meeting the actual running condition of the computer device to optimize the resource that needs to be consumed by the computation node, the resource use condition of each computation node during computation can be dynamically adjusted in the process that the computer device runs to the computation node, thereby improving the utilization rate of the resource on the computer device.

Description

计算图的优化方法、装置、计算机设备和存储介质Optimization method, device, computer equipment and storage medium of calculation graph
本申请要求在2019年12月9日提交中国专利局、申请号为201911249112.X、发明名称为“计算图的优化方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911249112.X, and the invention title is "Computer Graph Optimization Method, Apparatus, Computer Equipment, and Storage Medium" on December 9, 2019. All of them The content is incorporated in this application by reference.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种计算图的优化方法、装置、计算机设备和存储介质。This application relates to the field of computer technology, and in particular to a method, device, computer equipment, and storage medium for optimizing a calculation graph.
背景技术Background technique
随着计算机网络技术的发展,以及大数据时代的来临,应用于各种技术领域的计算网络模型越来越复杂,但是复杂度高的计算网络模型,例如,神经网络模型,对计算机设备的硬件指标提出了挑战,因此,如何优化计算网络模型成为了目前研究人员比较关注的问题。With the development of computer network technology and the advent of the era of big data, computational network models applied in various technical fields are becoming more and more complex, but computational network models with high complexity, such as neural network models, are more important for the hardware of computer equipment. Indicators pose challenges. Therefore, how to optimize the computational network model has become a problem that researchers are paying more attention to.
现有的计算网络模型的优化过程为,采用统一优化方法对计算网络模型进行优化,即,根据用户提出的硬件指标要求,针对计算网络模型和结合应用环境,设计与该计算网络模型和应用环境对应的优化模型,使该优化模型在后期被编译运行时消耗的计算机设备资源能够满足用户提出的性能指标要求。The optimization process of the existing computing network model is to use a unified optimization method to optimize the computing network model, that is, to design the computing network model and the application environment for the computing network model and the combined application environment according to the hardware index requirements proposed by the user. The corresponding optimization model enables the computer equipment resources consumed when the optimization model is compiled and run in the later stage to meet the performance index requirements proposed by the user.
但是,上述优化方法只能适用于同一种计算网络模型和应用环境,当变更计算网络模型和应用环境时,需要重新设计对应的优化方法,因此,上述优化方法的适配性极低,进而导致计算网络模型的运行效率极低。However, the above optimization method can only be applied to the same computing network model and application environment. When the computing network model and application environment are changed, the corresponding optimization method needs to be redesigned. Therefore, the adaptability of the above optimization method is extremely low, which leads to The operating efficiency of the computational network model is extremely low.
发明内容Summary of the invention
基于此,有必要针对上述技术问题,提供一种能够有效提高适配性和执行效率的计算图的优化方法、装置、计算机设备和存储介质。Based on this, it is necessary to provide a method, device, computer equipment, and storage medium for optimizing calculation graphs that can effectively improve adaptability and execution efficiency in response to the above technical problems.
第一方面,一种计算图的优化方法,所述方法包括:In the first aspect, a method for optimizing a calculation graph, the method includes:
获取计算网络模型的计算图;计算图中包括多个计算节点;Obtain the calculation graph of the calculation network model; the calculation graph includes multiple calculation nodes;
在计算图中插入至少一个检查节点;Insert at least one check node in the calculation graph;
当运行至每一个检查节点时,通过当前的检查节点获取当前性能裕量;When running to each inspection node, obtain the current performance margin through the current inspection node;
根据当前延时性能裕量,确定优化策略;Determine the optimization strategy according to the current delay performance margin;
根据优化策略对当前的检查节点之后的计算节点进行优化。According to the optimization strategy, the computing nodes after the current check node are optimized.
在其中一个实施例中,当前性能裕量包括当前延时性能裕量,根据当前性能裕量,确定优化策略,包括:若当前延时性能裕量充足,则将存储优化策略确定为优化策略;存储优化策略用于减少计算节点在计算时占用的内存;若当前延时性能裕量不充足,则将延时优化策略确定为优化策略;延时优化策略用于减少计算节点在计算时的计算消耗时长。In one of the embodiments, the current performance margin includes the current delay performance margin. According to the current performance margin, determining the optimization strategy includes: if the current delay performance margin is sufficient, determining the storage optimization strategy as the optimization strategy; The storage optimization strategy is used to reduce the memory occupied by the computing node during the calculation; if the current delay performance margin is not sufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the calculation of the computing node during the calculation It takes a long time.
在其中一个实施例中,当前性能裕量包括当前存储性能裕量,根据当前延时性能裕量,确定优化策略,包括:若当前存储性能裕量充足,则将延时优化策略确定为优化策略;延时优化策略用于减少计算节点在计算时的消耗时长;若当前存储性能裕量不充足,则将存储优化策略确定为优化策略;存储优化策略用于减少计算节点在计算时占用的内存。In one of the embodiments, the current performance margin includes the current storage performance margin, and the optimization strategy is determined according to the current delay performance margin, including: if the current storage performance margin is sufficient, determining the delay optimization strategy as the optimization strategy ; The delay optimization strategy is used to reduce the time consumed by the computing node in the calculation; if the current storage performance margin is not sufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node in the calculation .
在其中一个实施例中,所述存储优化策略包括:将检查节点之后的计算节点在计算时产生的数据存储至高访问延迟的存储空间;高访问延迟的存储空间至少包括全局内存和片外存储器;和/或,延时优化策略包括:将检查节点之后的计算节点在计算时产生的数据存储至低访问延迟的存储空间;低访问延迟的存储空间至少包括缓存空间和片内存储。In one of the embodiments, the storage optimization strategy includes: storing the data generated by the computing node after the check node during calculation in a storage space with high access latency; the storage space with high access latency includes at least global memory and off-chip memory; And/or, the delay optimization strategy includes: storing the data generated by the computing node after the check node during calculation in a storage space with low access latency; the storage space with low access latency includes at least cache space and on-chip storage.
在其中一个实施例中,延时优化策略还包括:获取检查节点之后的计算节点在计算时产生的数据的大小;将计算节点在计算时产生的数据的大小与预设存储内存的大小进行比较;若计算节点在计算时产生的数据的大小超出预设存储内存大小,则拆分检查节点之后的计算节点,并将拆分后的计算节点在计算时产生的数据存储至低访问延迟的存储空间;若计算节点在计算时产生的数据的大小未超出预设存储内存大小,则将检查节点之后的计算节点在计算时产生的数据存储至低访问延迟的存储空间。In one of the embodiments, the delay optimization strategy further includes: obtaining the size of the data generated by the computing node after the check node during the calculation; comparing the size of the data generated by the computing node during the calculation with the size of the preset storage memory ; If the size of the data generated by the computing node during the calculation exceeds the preset storage memory size, the computing node after the check node will be split, and the data generated by the split computing node during the calculation will be stored in the storage with low access latency Space; if the size of the data generated by the computing node during the calculation does not exceed the preset storage memory size, the data generated during the calculation by the computing node after the check node is stored in a storage space with low access latency.
在其中一个实施例中,通过当前的检查节点获取当前性能裕量,包括:获取当前的检查节点之前的所有计算节点的第一总目标计算消耗时长和总实际计算消耗时长;根据第一总目标计算消耗时长和总实际计算消耗时长,确定当前延时性能裕量。In one of the embodiments, obtaining the current performance margin through the current check node includes: obtaining the first total target calculation consumption time length and the total actual calculation consumption time length of all computing nodes before the current check node; according to the first total target Calculate the consumption time and the total actual calculation consumption time to determine the current delay performance margin.
在其中一个实施例中,获取当前的检查节点之前的所有计算节点的第一总目标计算消耗时长,包括:获取当前的检查节点所在路径上的所有计算节点的第二总目标计算消耗时长;根据第二总目标计算消耗时长和预设比例,确定第一总目标计算消耗时长;预设比例为当前的检查节点之前的所有计算节点的总计算消耗时长在检查节点所在路径上的所有计算节点的总计算消耗时长中所占比例。In one of the embodiments, obtaining the first total target calculation consumption time length of all computing nodes before the current check node includes: obtaining the second total target calculation consumption time length of all the computing nodes on the path where the current check node is located; The second total target calculation consumption time and the preset ratio are used to determine the first total target calculation consumption time; the preset ratio is the total calculation consumption time of all the computing nodes before the current check node and the total calculation consumption time of all the computing nodes on the path where the check node is located. The percentage of the total computing time consumed.
在其中一个实施例中,在计算图中插入至少一个检查节点,包括:获取计算图中最长路径上各计算节点的计算消耗时长比例;根据计算消耗时长比例,在最长路径上确定至少一个检查节点的插入位置;在至少一个检查节点的插入位置处,插入至少一个所述检查节点。In one of the embodiments, inserting at least one check node in the calculation graph includes: obtaining the calculation consumption time ratio of each calculation node on the longest path in the calculation graph; and determining at least one check node on the longest path according to the calculation consumption time ratio Insertion position of the inspection node; insert at least one inspection node at the insertion position of the at least one inspection node.
在其中一个实施例中,获取计算图中最长路径上各计算节点的计算消耗时长比例,包括:获取最长路径上各计算节点的计算量;根据各计算节点的计算量获取最长路径上各计算节点的计算消耗时长;根据最长路径上各计算节点的计算消耗时长,确定最长路径上各计算节点的计算消耗时长比例。In one of the embodiments, obtaining the proportion of computing time consumed by each computing node on the longest path in the calculation graph includes: obtaining the calculation amount of each computing node on the longest path; obtaining the calculation amount of each computing node on the longest path The computing time consumed by each computing node; according to the computing time consumed by each computing node on the longest path, the ratio of computing time consumed by each computing node on the longest path is determined.
在其中一个实施例中,获取计算图中最长路径上各计算节点的计算消耗时长比例,包括:构建消耗时长预估模型;采用消耗时长预估模型,获取最长路径上各计算节点的计算消耗时长;根据最长路径上各计算节点的计算消耗时长,确定最长路径上各计算节点的计算消耗时长比例。In one of the embodiments, obtaining the calculation consumption time ratio of each computing node on the longest path in the calculation graph includes: constructing a consumption time estimation model; using the consumption time estimation model to obtain the calculation of each computing node on the longest path Time consumption: According to the calculation time consumption of each computing node on the longest path, the ratio of the calculation time consumption of each computing node on the longest path is determined.
在其中一个实施例中,根据计算消耗时长比例,在最长路径上确定至少一个检查节点的插入位置,包括:根据计算消耗时长比例,将最长路径均分成预设数量的多个子路径;在多个子路径中选择至少 一个子路径作为插入检查点的插入位置。In one of the embodiments, determining the insertion position of at least one check node on the longest path according to the calculated consumption time ratio includes: dividing the longest path into a preset number of multiple sub-paths according to the calculated consumption time ratio; At least one of the multiple sub-paths is selected as the insertion position for inserting the checkpoint.
在其中一个实施例中,在计算图中插入至少一个检查节点,包括:获取计算图中间隔至少一个计算节点的始端计算节点和末端计算节点;在始端计算节点和末端计算节点的中间位置插入至少一个检查节点。In one of the embodiments, inserting at least one check node in the calculation graph includes: obtaining a start-end calculation node and an end calculation node that are separated by at least one calculation node in the calculation graph; inserting at least one check node in the middle position between the start-end calculation node and the end calculation node A check node.
在其中一个实施例中,获取计算网络模型的计算图,包括:加载计算网络模型的拓扑结构和参数;对计算网络模型的拓扑结构和参数进行编译,得到计算网络模型的计算图。In one of the embodiments, obtaining the calculation graph of the calculation network model includes: loading the topology structure and parameters of the calculation network model; compiling the topology structure and parameters of the calculation network model to obtain the calculation graph of the calculation network model.
第二方面,一种计算图的优化装置,所述装置包括:In a second aspect, an optimization device for a calculation graph, the device includes:
第一获取模块,用于获取计算网络模型的计算图;计算图中包括多个计算节点;The first obtaining module is used to obtain a calculation graph of a calculation network model; the calculation graph includes a plurality of calculation nodes;
插入模块,用于在计算图中插入至少一个检查节点;Insert module, used to insert at least one check node in the calculation graph;
第二获取模块,用于当运行至每一个检查节点时,通过当前的检查节点获取当前性能裕量;The second obtaining module is used to obtain the current performance margin through the current check node when running to each check node;
确定模块,用于根据当前性能裕量,确定优化策略;The determining module is used to determine the optimization strategy according to the current performance margin;
优化模块,用于根据优化策略对当前的检查节点之后的计算节点进行优化。The optimization module is used to optimize the computing nodes after the current check node according to the optimization strategy.
第三方面,一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现第一方面任一实施例所述的计算图的优化方法。In a third aspect, a computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the calculation graph optimization method described in any embodiment of the first aspect when the processor executes the computer program.
第四方面,一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现第一方面任一实施例所述的计算图的优化方法。In a fourth aspect, a computer-readable storage medium has a computer program stored thereon, and the computer program, when executed by a processor, implements the method for optimizing the calculation graph described in any one of the embodiments of the first aspect.
本申请提供的一种计算图的优化方法、装置、计算机设备和存储介质,通过获取包括多个计算节点的计算网络模型的计算图,再在计算图中插入至少一个检查节点,并当运行至每一个检查节点时,通过当前的检查节点获取当前性能裕量,然后根据当前性能裕量,确定优化策略,以及根据该优化策略对当前的检查节点之后的计算节点所需消耗资源进行优化。上述优化方法通过插入检查节点获取计算机设备在运行到各检查节点时的当前性能裕量,然后根据当前性能裕量,选择符合计算机设备实际运行情况的优化策略对检查节点之后的计算节点所需消耗资源进行优化,使计算机设备运行至上述计算节点的过程中,可以动态调整计算图中各计算节点计算时的资源使用情况,以满足用户针对该计算图提出的性能指标需求,以及提高计算机设备上的资源利用率。The method, device, computer equipment, and storage medium for optimizing calculation graphs provided in this application obtain a calculation graph of a calculation network model including multiple calculation nodes, and then insert at least one check node in the calculation graph, and when it runs to At each check node, the current performance margin is obtained through the current check node, and then an optimization strategy is determined according to the current performance margin, and the required consumption resources of the computing nodes after the current check node are optimized according to the optimization strategy. The above optimization method obtains the current performance margin of the computer equipment when it runs to each check node by inserting the check node, and then selects the optimization strategy that meets the actual operation of the computer equipment according to the current performance margin to consume the computing node after the check node Resources are optimized, so that when the computer equipment is running to the above-mentioned computing node, the resource usage of each computing node in the calculation graph can be dynamically adjusted to meet the performance index requirements of the user for the calculation graph, and to improve the computer equipment Resource utilization.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other objectives, features and advantages of the present invention more obvious and understandable. In the following, specific embodiments of the present invention are specifically cited.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描In order to more clearly describe the technical solutions in the embodiments of the present invention or the prior art, the following will describe the embodiments or the prior art. 述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施The drawings that need to be used in the description are briefly introduced. Obviously, the drawings in the following description are some implementations of the present invention. 例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得For example, for those of ordinary skill in the art, without creative work, they can also obtain information from these drawings. 其他的附图。Other drawings.
图1为一个实施例提供的一种计算机设备的内部结构示意图;FIG. 1 is a schematic diagram of the internal structure of a computer device provided by an embodiment;
图2为一个实施例提供的一种计算图的优化方法的流程图;FIG. 2 is a flowchart of a method for optimizing a calculation graph according to an embodiment;
图2A为一个实施例提供的一种计算图的优化方法的流程图;2A is a flowchart of a method for optimizing a calculation graph provided by an embodiment;
图3为图2实施例中S103的一种实现方式的流程图;FIG. 3 is a flowchart of an implementation manner of S103 in the embodiment of FIG. 2;
图4为图3实施例中S201的一种实现方式的流程图;FIG. 4 is a flowchart of an implementation manner of S201 in the embodiment of FIG. 3;
图5为图2实施例中S102的一种实现方式的流程图;FIG. 5 is a flowchart of an implementation manner of S102 in the embodiment of FIG. 2;
图6为图5实施例中S401的一种实现方式的流程图;FIG. 6 is a flowchart of an implementation manner of S401 in the embodiment of FIG. 5;
图7为图5实施例中S401的另一种实现方式的流程图;FIG. 7 is a flowchart of another implementation manner of S401 in the embodiment of FIG. 5;
图8为图5实施例中S402的另一种实现方式的流程图;FIG. 8 is a flowchart of another implementation manner of S402 in the embodiment of FIG. 5;
图8A为一个实施例提供的一种计算图的结构示意图;FIG. 8A is a schematic structural diagram of a calculation graph provided by an embodiment;
图8B为一个实施例提供的一种计算图的结构示意图;FIG. 8B is a schematic structural diagram of a calculation graph provided by an embodiment;
图9为图2实施例中S102的另一种实现方式的流程图;FIG. 9 is a flowchart of another implementation manner of S102 in the embodiment of FIG. 2;
图9A为一个实施例提供的一种计算图的结构示意图;FIG. 9A is a schematic structural diagram of a calculation graph provided by an embodiment;
图10为图2实施例中S101的一种实现方式的流程图;Fig. 10 is a flowchart of an implementation manner of S101 in the embodiment of Fig. 2;
图11为一个实施例提供的一种计算图的优化方法的流程图;FIG. 11 is a flowchart of a method for optimizing a calculation graph provided by an embodiment;
图12为一个实施例提供的一种计算图的优化装置的结构示意图;FIG. 12 is a schematic structural diagram of an optimization device for a calculation graph provided by an embodiment; FIG.
图13示意性地示出了用于执行根据本发明的方法的计算处理设备的框图;Fig. 13 schematically shows a block diagram of a computing processing device for executing the method according to the present invention;
图14示意性地示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。Fig. 14 schematically shows a storage unit for holding or carrying program codes for implementing the method according to the present invention.
具体实施例Specific embodiment
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅用以解释本申请,并不用于限定本申请。显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions, and advantages of this application clearer and clearer, the following further describes the application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application. Obviously, the described embodiments are part of the embodiments of the present invention, rather than all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
本申请提供的计算图的优化方法,可以应用于如图1所示的计算机设备中,该计算机设备可以是服务器,也可以是终端,其内部结构图可以如图1所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时 以实现一种计算图的优化方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。The method for optimizing the calculation graph provided in this application can be applied to the computer device as shown in FIG. 1. The computer device may be a server or a terminal, and its internal structure diagram may be as shown in FIG. The computer equipment includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize an optimization method of the calculation graph. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
本领域技术人员可以理解,图1中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 1 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
下面将通过实施例并结合附图具体地对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。Hereinafter, the technical solution of the present application and how the technical solution of the present application solves the above-mentioned technical problems will be described in detail through the embodiments and the accompanying drawings. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.
图2为一个实施例提供的一种计算图的优化方法的流程图,该方法的执行主体为图1中的计算机设备,该方法涉及的是计算机设备在运行计算网络模型的计算图时对该计算图进行优化的具体过程。如图2所示,该方法具体包括以下步骤:Fig. 2 is a flowchart of a method for optimizing a calculation graph provided by an embodiment. The execution body of the method is the computer device in Fig. 1, and the method involves the computer device running the calculation graph of the calculation network model. The specific process of optimization of the calculation graph. As shown in Figure 2, the method specifically includes the following steps:
S101、获取计算网络模型的计算图;计算图中包括多个计算节点。S101. Obtain a calculation graph of a calculation network model; the calculation graph includes a plurality of calculation nodes.
其中,计算网络模型可以由计算机设备预先根据实际应用需求构建,其具体可以是具备各种功能应用的计算网络模型,例如,神经网络模型、机器学习网络模型、智能算法网络模型等。计算图是一种描述计算方法的“语言”,具体由多个计算节点组成,多个存在依赖关系的计算节点之间相互连接。计算节点可以包括执行某种计算功能的代码,用于使计算机设备运行至计算节点时,可以执行计算网络模型中对应的计算任务。Among them, the computing network model may be constructed by computer equipment in advance according to actual application requirements, and it may specifically be a computing network model with various functional applications, such as a neural network model, a machine learning network model, and an intelligent algorithm network model. Computational graphs are a kind of "language" for describing calculation methods, which are specifically composed of multiple computing nodes, and multiple computing nodes with dependencies are connected to each other. The computing node may include code for executing a certain computing function, so that when the computer device runs to the computing node, it can execute the corresponding computing task in the computing network model.
本实施例中,计算机设备可以通过编译器编译预设的计算网络模型,生成编译后的计算图。可选地,计算机设备也可以通过其它方法直接获取到经过编译会后的计算网络模型的计算图,对此本实施例不做限定。可选地,当编译器编译计算网络模型之前,计算机设备还可以预先根据实际应用需求构建该计算网络模型,然后再基于构建好的计算网络模型进行编译,以便之后运行时使用。可选地,计算机设备也可以直接获取预编译的计算网络模型,再基于获取到的计算网络模型进行编译,以便之后运行时使用,对此本实施例不做限定。In this embodiment, the computer device can compile a preset calculation network model through a compiler to generate a compiled calculation graph. Optionally, the computer device may also directly obtain the calculation graph of the calculation network model after the compilation meeting through other methods, which is not limited in this embodiment. Optionally, before the compiler compiles the computational network model, the computer device may also construct the computational network model according to actual application requirements in advance, and then compile the computational network model based on the constructed computational network model for later use. Optionally, the computer device can also directly obtain a pre-compiled computing network model, and then compile it based on the obtained computing network model for later use during runtime, which is not limited in this embodiment.
S102、在计算图中插入至少一个检查节点。S102. Insert at least one check node in the calculation graph.
其中,检查节点可以包括执行某种计算或测试功能的代码,用于使计算机设备运行至检查节点时,可以执行相应的计算或测试任务,该检查节点可以由计算设备预先配置。本实施例中,当计算机设备获取到计算网络模型的计算图,以及需要在之后运行该计算图的过程中优化该计算图时,可以进一步的在该计算图中插入至少一个检查节点,以使计算图被运行至插入的检查节点时,计算机设备可以检测计算机设备在当前时刻的资源消耗情况,从而根据当前资源消耗情况,动态调整之后 计算节点的资源利用方式,使计算图在被执行的过程中,消耗的资源情况始终能够满足用户提出的性能指标或达到最优,充分利用了计算机设备上的资源。Wherein, the check node may include code for executing a certain calculation or test function, so that when the computer device runs to the check node, the corresponding calculation or test task can be executed, and the check node may be pre-configured by the computing device. In this embodiment, when the computer device obtains the calculation graph of the calculation network model and needs to optimize the calculation graph in the process of running the calculation graph later, at least one check node may be further inserted into the calculation graph to make When the calculation graph is run to the inserted check node, the computer device can detect the resource consumption of the computer device at the current moment, so as to dynamically adjust the resource utilization mode of the subsequent calculation node according to the current resource consumption, so that the calculation graph is in the process of being executed In the process, the consumption of resources can always meet the performance indicators proposed by the user or reach the optimum, making full use of the resources on the computer equipment.
S103、当运行至每一个检查节点时,通过当前的检查节点获取当前性能裕量。S103: When running to each check node, obtain the current performance margin through the current check node.
其中,当前性能裕量表示计算设备运行至当前的检查节点时实际消耗的计算设备资源与用户期望性能指标所示的计算设备资源之间的裕量,该当前性能裕量可以是表示延时性能指标的性能裕量,也可以是表示存储性能指标的性能裕量,或是表示计算机设备计算时所消耗的其它类型性能指标的性能裕量。具体的,上述表示延时性能指标的性能裕量指计算设备运行至当前的检查节点时实际消耗的计算消耗时长和用户期望的计算消耗时长之间的裕量;上述表示存储指标的性能裕量指计算设备运行至当前的检查节点时实际消耗的内存大小和用户期望消耗的内存大小之间的裕量。在实际应用中,若实际消耗的计算设备资源大于或等于用户期望性能指标所示的计算设备资源时,表示当前性能裕量不充足,若实际消耗的计算设备资源小于用户期望性能指标所示的计算设备资源时,表示当前性能裕量充足。本实施例中,当计算机设备运行至每一个当前的检查节点时,计算机设备可以通过执行该检查节点上的代码获取到计算设备的当前性能裕量,以便之后计算机设备根据该当前性能裕量,对检查点之后的计算节点进行不同方法的优化,从而使优化后的计算节点在被执行时可以充分利用计算设备的资源。Wherein, the current performance margin represents the margin between the computing device resources actually consumed when the computing device runs to the current check node and the computing device resources indicated by the user's expected performance index, and the current performance margin may represent delay performance The performance margin of the index may also be the performance margin of the storage performance index, or the performance margin of other types of performance indexes consumed by the computer equipment during calculation. Specifically, the above performance margin representing the delay performance index refers to the margin between the actual calculation consumption time consumed when the computing device runs to the current check node and the calculation consumption time expected by the user; the above represents the performance margin of the storage index Refers to the margin between the memory size actually consumed when the computing device runs to the current check node and the memory size expected to be consumed by the user. In practical applications, if the actual consumption of computing device resources is greater than or equal to the computing device resources indicated by the user's expected performance indicators, it means that the current performance margin is insufficient. If the actual consumed computing device resources are less than the user's expected performance indicators, When calculating equipment resources, it indicates that the current performance margin is sufficient. In this embodiment, when the computer device runs to each current check node, the computer device can obtain the current performance margin of the computing device by executing the code on the check node, so that the computer device can then obtain the current performance margin according to the current performance margin. The computing nodes after the checkpoint are optimized in different ways, so that the optimized computing nodes can make full use of the resources of the computing device when they are executed.
S104、根据当前性能裕量,确定优化策略。S104: Determine an optimization strategy according to the current performance margin.
其中,优化策略用于对检查节点之后的计算节点所需消耗的资源进行优化,使之后的计算节点在被执行时所消耗的资源能够满足用户需求或与计算设备的性能指标匹配。本实施例中,当计算机设备通过当前的检查节点获取当前性能裕量时,可以通过判断当前性能裕量是否充足,从而根据判断结果选择不同的优化策略,以动态优化计算图中检查节点之后的计算节点。例如,若表示延时性能指标的当前性能裕量充足,则可以采用减少内存的优化策略进行编译和运行,以降低计算机设备存储时的性能消耗,使计算机设备各方面的性能指标均能够满足用户需求,若表示延时性能指标的当前延时性能裕量不充足,则可以采用减少访问延时高的访存操作的优化策略进行编译和运行,以降低计算机设备计算时的延时性能消耗,使计算机设备各方面的性能指标均能够满足用户需求。相应的,若表示存储性能指标的当前性能裕量充足,则可以采用减少访问延时高的访存操作的优化策略进行编译和运行,以降低计算机设备计算时的延时性能消耗;若表示存储性能指标的当前性能裕量充足,以降低计算机设备存储时的性能消耗。The optimization strategy is used to optimize the resources consumed by the computing nodes after the check node, so that the resources consumed by the subsequent computing nodes when executed can meet user needs or match the performance indicators of the computing device. In this embodiment, when the computer device obtains the current performance margin through the current check node, it can determine whether the current performance margin is sufficient, and then select different optimization strategies according to the judgment result to dynamically optimize the calculation after the check node in the graph. calculate node. For example, if the current performance margin of the delay performance index is sufficient, the optimization strategy of reducing memory can be used to compile and run to reduce the performance consumption of the computer equipment during storage, so that the performance indicators of all aspects of the computer equipment can meet the user Demand, if the current delay performance margin of the delay performance index is not sufficient, the optimization strategy of reducing the memory access operation with high access delay can be used to compile and run, so as to reduce the delay performance consumption of computer equipment calculation. So that the performance indicators of all aspects of computer equipment can meet the needs of users. Correspondingly, if it indicates that the current performance margin of the storage performance index is sufficient, the optimization strategy of reducing the memory access operation with high access latency can be used to compile and run, so as to reduce the delay performance consumption of computer equipment calculation; if it indicates that the storage The current performance margin of the performance index is sufficient to reduce the performance consumption of the computer equipment during storage.
S105、根据优化策略对当前的检查节点之后的计算节点进行优化。S105: Optimize computing nodes after the current check node according to the optimization strategy.
本实施例中,当计算机设备根据当前性能裕量确定优化策略后,即可根据该优化策略对当前的检查节点之后的计算节点上的参数或变量进行优化,例如,计算机设备可以改变计算节点上参数或 变量的存储方式,从而改变该计算节点读取或写入数据时的时间长度,进而改变计算机设备运行至该计算节点时的计算时间,以改善计算设备的延时性能,完成对该计算节点的优化。又例如,计算机设备还可以拆分计算节点,使一个计算节点所消耗的资源分成多个计算节点所消耗的资源,以减轻计算设备运行至各计算节点时的资源消耗负担,完成对该计算节点的优化。In this embodiment, after the computer device determines the optimization strategy according to the current performance margin, the parameters or variables on the computing node after the current check node can be optimized according to the optimization strategy. For example, the computer device can change the parameters or variables on the computing node. The storage method of parameters or variables, thereby changing the length of time when the computing node reads or writing data, and then changes the computing time when the computing device runs to the computing node, so as to improve the delay performance of the computing device and complete the calculation Optimization of nodes. For another example, a computer device can also split a computing node, so that the resources consumed by one computing node are divided into resources consumed by multiple computing nodes, so as to reduce the burden of resource consumption when the computing device runs to each computing node, and complete the computing node Optimization.
本实施例提供的计算图的优化方法,通过获取包括多个计算节点的计算网络模型的计算图,再在计算图中插入至少一个检查节点,并当运行至每一个检查节点时,通过当前的检查节点获取当前性能裕量,然后根据当前性能裕量,确定优化策略,以及根据该优化策略对当前的检查节点之后的计算节点所需消耗资源进行优化。上述优化方法通过插入检查节点获取计算机设备在运行到各检查节点时的当前性能裕量,然后根据当前性能裕量,选择符合计算机设备实际运行情况的优化策略对检查节点之后的计算节点所需消耗资源进行优化,使计算机设备运行至上述计算节点的过程中,可以动态调整计算图中各计算节点计算时的资源使用情况,以满足用户针对该计算图提出的性能指标需求,以及提高计算机设备上的资源利用率。The method for optimizing the calculation graph provided in this embodiment obtains a calculation graph of a calculation network model including multiple calculation nodes, inserts at least one check node in the calculation graph, and when it runs to each check node, passes the current The check node obtains the current performance margin, and then determines an optimization strategy according to the current performance margin, and optimizes the required consumption resources of the computing nodes after the current check node according to the optimization strategy. The above optimization method obtains the current performance margin of the computer equipment when it runs to each check node by inserting the check node, and then selects the optimization strategy that meets the actual operation of the computer equipment according to the current performance margin to consume the computing node after the check node Resources are optimized, so that when the computer equipment is running to the above-mentioned computing node, the resource usage of each computing node in the calculation graph can be dynamically adjusted to meet the performance index requirements of the user for the calculation graph, and to improve the computer equipment Resource utilization.
在一个实施例中,上述当前性能裕量包括表示延时性能指标的性能裕量,即当前延时性能裕量,在该应用场景下,本申请提供了上述S104的一种实现方式,该方法包括:若当前延时性能裕量充足,则将存储优化策略确定为优化策略;存储优化策略用于减少计算节点在计算时占用的内存。In an embodiment, the foregoing current performance margin includes a performance margin representing a delay performance index, that is, the current delay performance margin. In this application scenario, this application provides an implementation manner of the foregoing S104. Including: if the current delay performance margin is sufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation.
本实施例涉及计算机设备获取到当前延时性能裕量充足的应用场景,说明计算机设备此时所需的计算消耗时长比较充裕,还能够满足后期计算节点的计算需求,在该种应用下,计算机设备可以不关注计算节点在计算时的计算消耗时长问题,可以重点关注计算节点在计算时占用的内存情况,以优化计算设备上的内存资源,避免计算机设备内存被占用过大,从而影响计算机设备的计算性能,进而影响计算图被执行的计算速度。This embodiment relates to an application scenario in which a computer device obtains a sufficient margin of current delay performance, indicating that the computing device requires a sufficient amount of time for calculation at this time, and it can also meet the calculation requirements of later computing nodes. In this application, the computer The device does not need to pay attention to the computing time consumed by the computing node during the calculation, but can focus on the memory occupied by the computing node during the calculation to optimize the memory resources on the computing device and prevent the computer device from being occupied too much, thereby affecting the computer device The calculation performance, which in turn affects the calculation speed at which the calculation graph is executed.
可选地,上述存储优化策略具体可以包括:将检查节点之后的计算节点在计算时产生的数据存储至高访问延迟的存储空间;高访问延迟的存储空间至少包括全局内存和片外存储器。Optionally, the aforementioned storage optimization strategy may specifically include: storing data generated by the computing node after the check node during calculation in a storage space with high access latency; the storage space with high access latency includes at least global memory and off-chip memory.
上述计算节点在计算时产生的数据可以包括计算中所需的中间结果和临时变量。当计算机设备根据存储优化策略对检查节点之后的计算节点进行优化时,具体可以将检查节点之后的计算节点在计算时产生的数据存储至高访问延迟的存储空间,例如,GPU的全局内存或TPU的片外存储等,以减少计算机设备内存的占用率,从而提高计算机设备的计算速度。The data generated by the calculation node during calculation may include intermediate results and temporary variables required in the calculation. When the computer device optimizes the computing node after the check node according to the storage optimization strategy, it can specifically store the data generated by the computing node after the check node in the storage space with high access latency, for example, the global memory of the GPU or the TPU. Off-chip storage, etc., to reduce the occupancy rate of the computer equipment memory, thereby increasing the computing speed of the computer equipment.
可选地,基于上述实施例,若当前延时性能裕量不充足,则将延时优化策略确定为优化策略;延时优化策略用于减少计算节点在计算时的计算消耗时长。Optionally, based on the foregoing embodiment, if the current delay performance margin is insufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the computational time spent by the computing node during calculation.
本实施例涉及计算机设备获取到当前延时性能裕量不充足的应用场景,说明计算机设备此时所需的计算消耗时长比较紧张,可能不能够满足后期计算节点的计算需求,在该种应用下,计算机设 备需要重点关注计算节点在计算时的计算消耗时长问题,可以不关注计算节点在计算时占用的内存问题,以优化计算设备的延时性能,避免计算节点在计算时消耗的时长过长,从而影响计算图被执行的计算速度。This embodiment relates to an application scenario where the current delay performance margin is not sufficient for the computer equipment to obtain, which indicates that the computing time required by the computer equipment at this time is relatively tight, and may not be able to meet the computing requirements of the later computing nodes. In this application , Computer equipment needs to focus on the computing time consumed by computing nodes during calculations. You can ignore the memory occupied by computing nodes during computing to optimize the delay performance of computing equipment and avoid computing nodes from consuming too long time during computing. , Thereby affecting the calculation speed of the calculation graph is executed.
可选地,上述延时优化策略具体可以包括:将检查节点之后的计算节点在计算时产生的数据存储至低访问延迟的存储空间;低访问延迟的存储空间至少包括缓存空间和片内存储。Optionally, the above-mentioned delay optimization strategy may specifically include: storing data generated by the computing node after the check node during calculation in a low-access-latency storage space; the low-access-latency storage space includes at least cache space and on-chip storage.
本实施例中,当计算机设备根据延时优化策略对检查节点之后的计算节点进行优化时,具体可以将检查节点之后的计算节点在计算时产生的数据存储至低访问延迟的存储空间,例如,计算机设备的内存或缓存等,以减少计算节点在计算时访问存储空间的时间,从而提高计算节点的计算速度,进而提高计算机设备的计算速度。In this embodiment, when the computer device optimizes the computing node after the check node according to the delay optimization strategy, it can specifically store the data generated by the computing node after the check node in the storage space with low access latency, for example, The memory or cache of the computer equipment, etc., to reduce the time for the computing node to access the storage space during calculation, thereby increasing the computing speed of the computing node, and then increasing the computing speed of the computer equipment.
在一个实施例中,上述当前性能裕量包括表示存储性能指标的性能裕量,即当前存储性能裕量,在该应用场景下,本申请提供了上述S104的一种实现方式,该方法包括:若当前存储性能裕量充足,则将延时优化策略确定为优化策略;延时优化策略用于减少所述计算节点在计算时的消耗时长。In an embodiment, the foregoing current performance margin includes a performance margin representing a storage performance index, that is, the current storage performance margin. In this application scenario, this application provides an implementation manner of the foregoing S104, and the method includes: If the current storage performance margin is sufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the time consumed by the computing node during calculation.
本实施例涉及计算机设备获取到当前存储性能裕量充足的应用场景,说明计算机设备此时所需的内存资源比较充裕,还能够满足后期计算节点的计算需求,在该种应用下,计算机设备需要重点关注计算节点在计算时的计算消耗时长问题,可以不关注计算节点在计算时占用的内存问题,以优化计算设备的延时性能,避免计算节点在计算时消耗的时长过长,从而影响计算图被执行的计算速度。This embodiment relates to an application scenario in which a computer device obtains a sufficient current storage performance margin, indicating that the memory resource required by the computer device at this time is relatively sufficient, and it can also meet the computing needs of the later computing node. In this application, the computer device needs Focus on the computational time consumed by the computing node during the calculation. You can ignore the memory occupied by the computing node during the calculation to optimize the delay performance of the computing device and avoid the computing node from spending too much time during the calculation, which affects the calculation The calculation speed at which the graph is executed.
可选地,基于上述实施例,若当前存储性能裕量不充足,则将存储优化策略确定为优化策略;存储优化策略用于减少计算节点在计算时占用的内存。Optionally, based on the foregoing embodiment, if the current storage performance margin is insufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation.
本实施例涉及计算机设备获取到当前存储性能裕量不充足的应用场景,说明计算机设备此时所需的内存资源比较紧张,可能不能够满足后期计算节点的计算需求,在该种应用下,计算机设备需要重点关注计算节点在计算时时占用的内存问题,可以不关注计算节点在计算时消耗时长问题,以优化计算设备的存储性能,避免计算节点在计算时内存占用过多,从而影响计算图被执行的计算速度。This embodiment relates to an application scenario where a computer device obtains the current storage performance margin is insufficient, indicating that the memory resource required by the computer device at this time is relatively tight, and may not be able to meet the computing needs of the later computing node. In this application, the computer The device needs to focus on the memory occupied by the computing node during the calculation. You can ignore the time consumption of the computing node during the calculation to optimize the storage performance of the computing device and avoid the computing node from occupying too much memory during the calculation, which affects the calculation graph. The speed of calculation performed.
可选地,在实际应用中,基于上述的延时优化策略,如图2A所示,该延时优化策略还可以包括:Optionally, in actual applications, based on the aforementioned delay optimization strategy, as shown in FIG. 2A, the delay optimization strategy may further include:
S1041、获取当前的检查节点之后的计算节点在计算时产生的数据的大小。S1041. Obtain the size of the data generated during the calculation by the computing node after the current check node.
本实施例适用于计算机设备上的内存或缓存不能够满足计算节点计算所需的内存或缓存时的应用场景,在该种应用场景下,当计算机设备确定的优化策略为延时优化策略时,可以先获取当前的 检查节点之后的计算节点在计算时产生的数据大小,以便之后根据该数据大小预估计算机设备上的内存或缓存是否满足计算要求。This embodiment is applicable to an application scenario when the memory or cache on a computer device cannot meet the memory or cache required by the computing node for calculation. In this application scenario, when the optimization strategy determined by the computer device is a delay optimization strategy, The data size generated by the computing node after the current check node during calculation can be obtained first, so as to estimate whether the memory or cache on the computer device meets the computing requirements based on the data size.
S1042、将计算节点在计算时产生的数据的大小与预设存储空间的大小进行比较,若计算节点在计算时产生的数据的大小超出预设存储空间的大小,则执行步骤S1043,若计算节点在计算时产生的数据的大小未超出预设存储空间的大小,则执行步骤S1044。S1042. Compare the size of the data generated by the computing node during calculation with the size of the preset storage space. If the size of the data generated by the computing node during calculation exceeds the size of the preset storage space, step S1043 is executed. If the computing node The size of the data generated during the calculation does not exceed the size of the preset storage space, then step S1044 is executed.
本实施例中,当计算机设备获取到计算节点在计算时产生的数据的大小时,可以进一步的将计算节点在计算时产生的数据的大小与预设存储空间的大小进行比较,得到比较结果,然后根据该比较结果选择不同的延时优化策略对当前的检查节点之后的计算节点进行优化处理。上述预设存储空间可以为低访问延迟的存储空间,例如计算机设备的内存和/或缓存空间。上述比较结果包括:计算节点在计算时产生的数据的大小超出预设存储空间的大小,此时说明计算机设备上现有的存储空间不能满足计算节点计算所需,以及计算节点在计算时产生的数据的大小未超出预设存储空间的大小,此时说明计算机设备上现有的存储空间还比较宽裕,能够满足计算节点计算所需。In this embodiment, when the computer device obtains the size of the data generated by the computing node during calculation, it can further compare the size of the data generated by the computing node during calculation with the size of the preset storage space to obtain a comparison result. Then, according to the comparison result, different delay optimization strategies are selected to optimize the computing nodes after the current check node. The aforementioned preset storage space may be a storage space with low access latency, such as the memory and/or cache space of a computer device. The above comparison results include: the size of the data generated by the computing node during the calculation exceeds the size of the preset storage space. At this time, it indicates that the existing storage space on the computer device cannot meet the computing requirements of the computing node, and the computing node generated during the calculation. The size of the data does not exceed the size of the preset storage space, which indicates that the existing storage space on the computer device is still relatively abundant, which can meet the computing requirements of the computing node.
S1043、拆分当前的检查节点之后的计算节点,并将拆分后的计算节点在计算时产生的数据存储至低访问延迟的存储空间。S1043: Split the computing node after the current check node, and store the data generated during the calculation of the split computing node in a storage space with low access latency.
本实施例涉及上述比较结果为计算节点在计算时产生的数据的大小超出预设存储空间的大小的应用场景,在该种应用下,计算机设备可以拆分检查节点之后的计算节点,并将拆分后的计算节点在计算时产生的数据存储至低访问延迟的存储空间,即计算机设备上的内存和/或缓存中。因为上述计算节点已被拆分,因此,计算机设备上现有的存储空间即可满足拆分后的各计算节点计算时所需的预设存储空间的大小。This embodiment relates to an application scenario in which the above-mentioned comparison result is that the size of the data generated by the computing node during calculation exceeds the size of the preset storage space. In this application, the computer device can split the computing node after the check node, and remove it. The data generated by the divided computing node during calculation is stored in a storage space with low access latency, that is, the memory and/or cache on the computer device. Because the foregoing computing nodes have been split, the existing storage space on the computer device can meet the size of the preset storage space required by the split computing nodes during calculation.
S1044、将当前的检查节点之后的计算节点在计算时产生的数据存储至低访问延迟的存储空间。S1044. Store the data generated during the calculation of the computing nodes after the current check node in a storage space with low access latency.
本实施例涉及上述比较结果为计算节点在计算时产生的数据的大小未超出预设存储空间的大小的应用场景,在该种应用下,计算机设备可以直接将检查节点之后的计算节点在计算时产生的数据存储至低访问延迟的存储空间。该步骤与前述描述延时优化策略时的步骤相同,详细内容请参见前述说明,在此不重复累赘说明。This embodiment relates to an application scenario where the result of the above comparison is that the size of the data generated by the computing node during calculation does not exceed the size of the preset storage space. In this application, the computer device can directly check the computing node after the node during the calculation. The generated data is stored in a storage space with low access latency. This step is the same as that described in the previous description of the delay optimization strategy. For details, please refer to the foregoing description, and the redundant description will not be repeated here.
图3为图2实施例中S103的一种实现方式的流程图,如图3所示,上述S103中的“通过当前的检查节点获取当前性能裕量”,包括:Fig. 3 is a flowchart of an implementation manner of S103 in the embodiment of Fig. 2. As shown in Fig. 3, the “obtain current performance margin through the current check node” in S103 includes:
S201、获取当前的检查节点之前的所有计算节点的第一总目标计算消耗时长和总实际计算消耗时长。S201. Obtain the first total target calculation consumption time length and the total actual calculation consumption time length of all the computing nodes before the current check node.
其中,第一总目标计算消耗时长表示当前的检查节点之前所有计算节点在计算时,用户期望的该所有计算节点累计的计算消耗时长。总实际消耗时长表示计算机设备运行至当前的检查节点时, 该当前的检查节点之前的所有计算节点计算时累计的实际计算消耗的时长。当计算机设备需要获取当前性能裕量时,可以先获取当前的检查节点之前的所有计算节点的第一总目标计算消耗时长和总实际计算消耗时长,以便之后根据该第一总目标计算消耗时长和总实际计算消耗时长确定当前性能裕量。Wherein, the first total target calculation consumption time period indicates the cumulative calculation consumption time length of all the calculation nodes expected by the user when all the calculation nodes are performing calculations before the current check node. The total actual consumption time refers to the actual calculation consumption time accumulated during the calculation of all the computing nodes before the current check node when the computer equipment runs to the current check node. When the computer device needs to obtain the current performance margin, it can first obtain the first total target calculation consumption time length and total actual calculation consumption time length of all computing nodes before the current check node, so as to calculate the total consumption time length and the total consumption time length according to the first total target. The total actual calculation consumption time determines the current performance margin.
S202、根据第一总目标计算消耗时长和总实际计算消耗时长,确定当前性能裕量。S202: Determine the current performance margin according to the first total target calculation consumption time period and the total actual calculation consumption time period.
当计算机设备获取到第一总目标计算消耗时长和总实际计算消耗时长,可以直接将该第一总目标计算消耗时长和总实际计算消耗时长进行差值运算,得到当前性能裕量;可选地,也可以将第一总目标计算消耗时长和总实际计算消耗时长分别加权后再进行差值运算,得到当前性能裕量。When the computer device obtains the first total target calculation consumption time and the total actual calculation consumption time, it can directly perform the difference calculation on the first total target calculation consumption time and the total actual calculation consumption time to obtain the current performance margin; optionally , It is also possible to weight the first total target calculation consumption time and the total actual calculation consumption time respectively and then perform the difference calculation to obtain the current performance margin.
可选地,如图4所示,上述S201中的“获取当前的检查节点之前的所有计算节点的第一总目标计算消耗时长”的一种方法具体可以包括:Optionally, as shown in FIG. 4, a method of "obtaining the first total target calculation consumption time length of all computing nodes before the current check node" in S201 may specifically include:
S301、获取当前的检查节点所在路径上的所有计算节点的第二总目标计算消耗时长;S301. Obtain the second total target calculation consumption time length of all the computing nodes on the path where the current check node is located;
其中,第二总目标计算消耗时长表示当前的检查节点所在路径上的所有计算节点在计算时,用户期望的累计的计算消耗时长。当计算机设备需要获取当前的检查节点之前的所有计算节点的第一总目标计算消耗时长时,可以先获取当前的检查节点所在路径上的所有计算节点的第二总目标计算消耗时长,之后再根据该第二总目标计算消耗时长确定第一总目标计算消耗时长。Wherein, the second total target calculation consumption time indicates the accumulated calculation consumption time expected by the user during calculation of all the computing nodes on the path where the current check node is located. When the computer device needs to obtain the first total target computing consumption time of all computing nodes before the current check node, it can first obtain the second total target computing consumption time of all computing nodes on the path where the current check node is located, and then according to The second total target calculation consumption time length determines the first total target calculation consumption time length.
S302、根据第二总目标计算消耗时长和预设比例,确定第一总目标计算消耗时长;预设比例为当前的检查节点之前的所有计算节点的总计算消耗时长在所述检查节点所在路径上的所有计算节点的总计算消耗时长中所占比例。S302. Determine the first total target calculation consumption time according to the second total target calculation consumption time and a preset ratio; the preset ratio is that the total calculation consumption time of all computing nodes before the current check node is on the path where the check node is located The percentage of the total computing time consumed by all computing nodes in the.
其中,预设比例可以由计算机设备预先获取,具体的可以通过多种方法得到,例如,计算机设备可以预先计算计算图中各计算节点的计算量,再基于各计算节点的计算量估计各计算节点的计算消耗时长,最后通过当前检查节点之前的计算节点的总的计算消耗时长,以及所有计算节点的总的计算消耗时长,确定预设比例。再例如,计算机设备还可以采用现有的消耗时长估计模型,预估各计算节点的计算消耗时长,再相应的通过当前检查节点之前的计算节点的总的计算消耗时长,以及所有计算节点的总的计算消耗时长,确定预设比例。可选的,预设比例还可以采用其它方法确定,对此本实施例不做限定。本实施例中,当计算机设备获取到当前的检查节点所在路径上所有计算节点的预设计算消耗时长,即第二总目标计算消耗时长,以及预设比例时,可以将第二总目标计算消耗时长和预设比例进行乘法运算,得到第一总目标计算消耗时长。例如,第二总目标计算消耗时长为10小时,预设比例为1/2,则对应的第一总目标计算消耗时长为5小时。可选地,计算机设备也可以将第二总目标计算消耗时长和预设比例进行加权后,再进行乘法运算,得到第一总目标计算消耗时长。Among them, the preset ratio can be obtained in advance by a computer device, and can be specifically obtained by a variety of methods. For example, the computer device can pre-calculate the calculation amount of each computing node in the calculation graph, and then estimate each computing node based on the calculation amount of each calculation node Finally, the total calculation consumption time of the calculation node before the current check node and the total calculation consumption time of all the calculation nodes are used to determine the preset ratio. For another example, the computer equipment can also use the existing consumption time estimation model to estimate the calculation consumption time of each computing node, and then correspondingly pass the total calculation consumption time of the computing node before the current check node, and the total computing time of all computing nodes. The calculation consumes time and determines the preset ratio. Optionally, the preset ratio may also be determined by other methods, which is not limited in this embodiment. In this embodiment, when the computer device obtains the preset calculation consumption duration of all computing nodes on the path where the current check node is located, that is, the second total target calculation consumption duration, and the preset ratio, the second total target calculation consumption may be The duration and the preset ratio are multiplied to obtain the first total target calculation consumption duration. For example, if the second total target calculation consumption time is 10 hours, and the preset ratio is 1/2, the corresponding first total target calculation consumption time is 5 hours. Optionally, the computer device may also weight the second total target calculation consumption time with a preset ratio, and then perform a multiplication operation to obtain the first total target calculation consumption time.
图5为图2实施例中S102的一种实现方式的流程图,如图5所示,上述S102中的“在计算图中插入至少一个检查节点”,包括:Fig. 5 is a flowchart of an implementation manner of S102 in the embodiment of Fig. 2. As shown in Fig. 5, “insert at least one check node in the calculation graph” in S102 includes:
S401、获取计算图中最长路径上计算节点的消耗时长比例。S401: Obtain a time-consuming ratio of a computing node on the longest path in the calculation graph.
本实施例中,计算机设备可以先根据计算图中各计算节点的布局方式确定其中的最长路径,再获取该最长路径上各计算节点的计算消耗时长,然后可以根据各计算节点的计算消耗时长进行比例运算,得到该最长路径上计算节点的计算消耗时长比例。In this embodiment, the computer device may first determine the longest path according to the layout mode of each computing node in the calculation graph, and then obtain the calculation consumption time of each computing node on the longest path, and then may base on the calculation consumption of each computing node The time length is calculated to obtain the ratio of the computing time consumed by the computing node on the longest path.
S402、根据计算消耗时长比例,在最长路径上确定至少一个检查节点的插入位置。S402: Determine the insertion position of at least one check node on the longest path according to the proportion of the calculated consumption time.
本实施例中,为了平衡最长路径上各计算节点的计算消耗时长,计算机设备在插入检查节点时,可以根据计算消耗时长比例,在最长路径上确定至少一个检查节点的插入位置,使该位置上插入的检查节点之前的所有计算节点和之后的所有计算节点的总的计算消耗时长相等。当然,也可以使该位置上插入的检查节点之前的所有计算节点和之后的所有计算节点的总的计算消耗时长不完全相等,使检查节点之前的所有计算节点和之后的所有计算节点的总的计算消耗时长之差在预设时长范围内即可。In this embodiment, in order to balance the computing time consumed by each computing node on the longest path, when the computer device inserts a check node, it can determine the insertion position of at least one check node on the longest path according to the proportion of the computing time consumed when inserting the check node. The total computational consumption time of all computing nodes before and after the inserted check node is equal. Of course, it is also possible to make the total computing consumption time of all computing nodes before and after the check node inserted at this position not completely equal, so that the total of all computing nodes before the check node and all computing nodes after the check node It is sufficient to calculate the difference between the consumed time and the preset time.
S403、在至少一个检查节点的插入位置处,插入至少一个检查节点。S403. Insert at least one check node at the insertion position of the at least one check node.
当计算机设备在最长路径上确定了至少一个检查节点的插入位置时,即可在至少一个检查节点的插入位置处,插入至少一个检查节点,以便之后通过该检查节点优化之后的计算节点。When the computer device determines the insertion position of at least one check node on the longest path, it can insert at least one check node at the insertion position of the at least one check node, so as to optimize the computing node after the check node.
可选地,基于上述实施例,本申请提供了计算机设备获取计算消耗时长比例的一种具体方式,如图6所示,上述S401“获取计算图中最长路径上计算节点的计算消耗时长比例”的一种方法具体可以包括:Optionally, based on the above-mentioned embodiment, this application provides a specific way for a computer device to obtain the proportion of computing time consumed. As shown in FIG. 6, the above-mentioned S401 "Obtain the proportion of computing time consumed by a computing node on the longest path in the calculation graph One method can specifically include:
S501、获取最长路径上各计算节点的计算量。S501: Obtain the calculation amount of each calculation node on the longest path.
当计算机设备通过编译计算网络模型,得到对应的计算图时,即可根据各计算节点包含的计算步骤等信息获取到该计算图中各计算节点的计算量。因此,本实施例中,计算机设备可以先确定计算图中的最长路径,再确定该最长路径上包含的各计算节点,然后即可根据最长路径上各计算节点包含的计算步骤等信息,得到最长路径上各计算节点的计算量。When the computer equipment obtains the corresponding calculation graph by compiling the calculation network model, it can obtain the calculation amount of each calculation node in the calculation graph according to the calculation steps included in each calculation node. Therefore, in this embodiment, the computer device can first determine the longest path in the calculation graph, and then determine the computing nodes included in the longest path, and then can determine the calculation steps included in each computing node on the longest path and other information , Get the calculation amount of each computing node on the longest path.
S502、根据各计算节点的计算量获取最长路径上各计算节点的计算消耗时长。S502: Obtain the calculation consumption time length of each calculation node on the longest path according to the calculation amount of each calculation node.
当计算机设备获取到最长路径上各计算节点的计算量时,可以进一步的根据各计算节点的计算量的大小,预估各计算节点的计算消耗时长,从而得到最长路径上各计算节点的计算消耗时长。上述各计算节点的计算量越大,预估的各计算节点的计算消耗时长越长,各计算节点的计算量越小,预估的各计算节点的计算消耗时长越小。When the computer equipment obtains the calculation amount of each calculation node on the longest path, it can further estimate the calculation time of each calculation node according to the calculation amount of each calculation node, so as to obtain the calculation amount of each calculation node on the longest path. Calculate the elapsed time. The greater the amount of calculation of the foregoing computing nodes, the longer the estimated computing time of each computing node, the smaller the amount of computing of each computing node, and the smaller the estimated computing time of each computing node.
S503、根据最长路径上各计算节点的计算消耗时长,确定最长路径上各计算节点的计算消耗时 长比例。S503: Determine the proportion of the computing time consumed by each computing node on the longest path according to the computing time consumed by each computing node on the longest path.
当计算机设备获取到最长路径上各计算节点的计算消耗时长时,即可对各计算节点的计算消耗时长进行比例运算,得到最长路径上各计算节点的计算消耗时长比例。When the computer device obtains the computing time consumed by each computing node on the longest path, it can perform a proportional calculation on the computing time consumed by each computing node to obtain the computing time consumed by each computing node on the longest path.
可选地,本申请提供了计算机设备获取计算消耗时长比例的另一种具体方式,如图7所示,上述S401“获取计算图中最长路径上各计算节点的计算消耗时长比例”的另一种方法具体可以包括:Optionally, the present application provides another specific way for the computer device to obtain the proportion of computing time consumed. As shown in FIG. 7, the above-mentioned S401 "Obtain the proportion of computing time consumed by each computing node on the longest path in the calculation graph" A method may specifically include:
S601、构建消耗时长预估模型。S601. Construct a consumption time estimation model.
当计算机设备需要获取最长路径上各计算节点的计算消耗时长比例时,可以先构建消耗时长预估模型,用于分析各计算节点包含的计算步骤等信息预估各计算节点的计算量。需要说明的是,上述消耗时长预估模型可以为预先训练好的预估模型,属于现有技术,对此不具体累赘说明。When the computer equipment needs to obtain the proportion of computing time consumed by each computing node on the longest path, it can first construct a time-consuming estimation model to analyze the calculation steps included in each computing node and estimate the calculation amount of each computing node. It should be noted that the above-mentioned time-consuming estimation model may be a pre-trained estimation model, which belongs to the prior art, and there is no detailed description of this.
S602、采用消耗时长预估模型,获取最长路径上各计算节点的计算消耗时长。S602. Use the consumption time estimation model to obtain the calculation consumption time of each computing node on the longest path.
当计算机设备构建完成消耗时长预估模型时,即可采用该消耗时长预估模型通过分析最长路径上各计算节点的计算步骤等信息,预估出最长路径上各计算节点的计算消耗时长。When the computer equipment has completed the construction of the consumption time estimation model, the consumption time estimation model can be used to estimate the calculation consumption time of each calculation node on the longest path by analyzing the calculation steps of each computing node on the longest path .
S603、根据最长路径上各计算节点的计算消耗时长,确定最长路径上各计算节点的计算消耗时长比例。S603: Determine the proportion of the computing time consumed by each computing node on the longest path according to the computing time consumed by each computing node on the longest path.
本实施例S603的步骤与上述S503的步骤内容相同,详细内容请参见前述说明,在此不重复累赘说明。在一个实施例中,还提供了上述S402步骤的具体实现方式,如图8所示,上述S402“根据计算消耗时长比例,在最长路径上确定至少一个检查节点的插入位置”,包括:The content of the step S603 in this embodiment is the same as the content of the step S503. For details, please refer to the foregoing description, and the redundant description will not be repeated here. In an embodiment, a specific implementation manner of the above step S402 is also provided. As shown in FIG. 8, the above S402 "determine the insertion position of at least one check node on the longest path according to the proportion of the calculated consumption time" includes:
S701、根据计算消耗时长比例,将最长路径均分成预设数量的多个子路径。S701. Divide the longest path into a preset number of multiple sub-paths according to the proportion of the calculated consumption time.
其中,预设数量表示在最长路径上预插入检查节点的数量,该预设数量可以由计算机设备预先根据最长路径的长短或实际应用需求确定。本实施例中,当计算机设备确定了插入检查节点的预设数量时,可以进一步的通过分析该最长路径上计算节点的计算消耗时长比例,将该最长路径均分成预设数量的多个子路径,使各子路径上的计算节点的计算消耗时长均衡。Wherein, the preset number represents the number of pre-inserted check nodes on the longest path, and the preset number can be determined by the computer device in advance according to the length of the longest path or actual application requirements. In this embodiment, when the computer device determines the preset number of check nodes to be inserted, it can further divide the longest path into a preset number of sub-components by analyzing the proportion of computing time consumed by the computing nodes on the longest path. The path balances the computing time consumed by the computing nodes on each sub-path.
S702、在多个子路径中选择至少一个子路径作为插入检查点的插入位置。S702. Select at least one sub-path from the multiple sub-paths as the insertion position for inserting the checkpoint.
当计算机设备将最长路径均分成预设数量的多个子路径,即可在多个子路径中选择至少一个子路径作为插入检查点的插入位置,其中选择的子路径之前的所有计算节点和之后的所有计算节点的总的计算消耗时长尽可能的相等,以平衡最长路径上各计算节点的计算消耗时长。示例性说明上述图8实施例所述的确定检查节点的插入位置的方法,如图8A所示的计算图(未插入检查节点时的计算图),该计算图中的最长路径上计算节点1、计算节点2、和计算节点3,已知计算节点1、计算节点2、和计算节点3的计算消耗时长比例是1:5:6,则通过分析该计算消耗时长比例,将该最长路径均匀划分为2份,使计算节点1和计算节点2的计算消耗时长总长(6小时),与计算节点3的计算消 耗时长(6小时)相等,则可以具体在计算节点2和计算节点3之间插入检查节点a(如图8B所示)。When the computer equipment divides the longest path into a preset number of multiple sub-paths, at least one sub-path can be selected from the multiple sub-paths as the insertion position of the insertion checkpoint, where all the computing nodes before the selected sub-path and those after the selected sub-path The total computing time of all computing nodes is as equal as possible to balance the computing time of each computing node on the longest path. Illustratively illustrate the method for determining the insertion position of the check node described in the embodiment of FIG. 8, as shown in the calculation graph of FIG. 8A (the calculation graph when the check node is not inserted), the calculation node on the longest path in the calculation graph 1. Computing node 2, and computing node 3. It is known that the computing time ratio of computing node 1, computing node 2, and computing node 3 is 1:5:6, then by analyzing the computing time ratio, the longest The path is evenly divided into 2 parts, so that the total computing time (6 hours) of computing node 1 and computing node 2 is equal to the computing time (6 hours) of computing node 3, then it can be specifically in computing node 2 and computing node 3. Insert check node a in between (as shown in Figure 8B).
图9为图2实施例中S102的另一种实现方式的流程图,如图9所示,上述S102中的“在计算图中插入至少一个检查节点”,包括:Fig. 9 is a flowchart of another implementation manner of S102 in the embodiment of Fig. 2. As shown in Fig. 9, "insert at least one check node in the calculation graph" in S102 includes:
S801、获取计算图中间隔至少一个计算节点的始端计算节点和末端计算节点。S801. Obtain a start-end computing node and an end-end computing node that are separated by at least one computing node in the calculation graph.
当计算图中存在跨计算节点的路径时,可以获取该路径上的始端计算节点和末端计算节点,以便之后在始端计算节点和末端计算节点之间插入检查节点。需要说明的是,跨计算节点的数量可以是一个,也可以是多个,对此本实施例不做限定。When there is a path across computing nodes in the calculation graph, the beginning and end computing nodes on the path can be obtained, so that check nodes can be inserted between the beginning and end computing nodes later. It should be noted that the number of cross-computing nodes may be one or multiple, which is not limited in this embodiment.
S802、在始端计算节点和末端计算节点的中间位置插入至少一个检查节点。S802. Insert at least one check node in the middle position between the start-end computing node and the end-end computing node.
当计算机设备确定了始端计算节点和末端计算节点时,即可在始端计算节点和末端计算节点的中间位置插入至少一个检查节点。示例性说明上述图9实施例所述的在计算图中插入至少一个检查节点的方法,如图8A所示的计算图(未插入检查节点时的计算图),该计算图中的最长路径上计算节点1、计算节点2、和计算节点3,其中计算节点1和计算节点3之间跨计算节点2,则计算节点1为始端计算节点,计算节点3为末端计算节点,则对应在计算节点1和计算节点3的中间位置插入一个检查节点b(如图9A所示)。图10为图2实施例中S101的一种实现方式的流程图,如图10所示,上述S101中的“获取计算网络模型的计算图”,包括:When the computer device determines the start-end computing node and the end-end computing node, at least one check node can be inserted in the middle of the start-end computing node and the end-end computing node. Illustratively illustrate the method of inserting at least one check node in the calculation graph described in the embodiment of FIG. 9, such as the calculation graph shown in FIG. 8A (the calculation graph when no check node is inserted), and the longest path in the calculation graph Compute node 1, compute node 2, and compute node 3, where compute node 1 and compute node 3 straddle compute node 2, then compute node 1 is the start-end compute node, and compute node 3 is the end compute node, which corresponds to A check node b is inserted between node 1 and computing node 3 (as shown in Figure 9A). FIG. 10 is a flowchart of an implementation manner of S101 in the embodiment of FIG. 2. As shown in FIG. 10, the "acquisition of the calculation diagram of the calculation network model" in the above S101 includes:
S901、加载计算网络模型的拓扑结构和参数。S901: Load the topological structure and parameters of the calculated network model.
在实际应用中,当计算机设备的编译器可以通过加载计算网络模型的拓扑结构和参数进行编译的方法获取到计算网络模型的计算图。In practical applications, the compiler of the computer equipment can obtain the calculation graph of the calculation network model by loading the topology structure and parameters of the calculation network model to compile.
S902、对计算网络模型的拓扑结构和参数进行编译,得到计算网络模型的计算图。S902. Compile the topological structure and parameters of the computational network model to obtain a computational graph of the computational network model.
当计算机设备的编译器载入计算网络模型的拓扑结构和参数时,可以进行编译,得到该计算网络模型的计算图,以便之后运行该计算图的过程中优化该计算图的计算消耗资源。When the compiler of the computer device loads the topological structure and parameters of the computational network model, it can be compiled to obtain the computational graph of the computational network model so as to optimize the computational consumption resources of the computational graph in the process of running the computational graph later.
综上,本申请还提供了一种计算图的优化方法,如图11所示,该方法包括:In summary, this application also provides a calculation graph optimization method, as shown in FIG. 11, the method includes:
S1001、加载计算网络模型的拓扑结构和参数。S1001. Load the topological structure and parameters of the calculated network model.
S1002、对计算网络模型的拓扑结构和参数进行编译,获取到计算网络模型的计算图。S1002, compile the topological structure and parameters of the computational network model, and obtain a computational graph of the computational network model.
S1003、获取计算图中最长路径上各计算节点的计算消耗时长比例。S1003. Obtain a calculation time-consuming ratio of each calculation node on the longest path in the calculation graph.
S1004、根据计算消耗时长比例,在最长路径上确定至少一个检查节点的插入位置。S1004. Determine the insertion position of at least one check node on the longest path according to the proportion of the calculated consumption time.
S1005、在至少一个检查节点的插入位置处,插入至少一个所述检查节点。S1005. Insert at least one inspection node at the insertion position of the at least one inspection node.
S1006、获取计算图中间隔至少一个计算节点的始端计算节点和末端计算节点。S1006. Obtain a start-end computing node and an end-end computing node that are separated by at least one computing node in the calculation graph.
S1007、在始端计算节点和末端计算节点的中间位置插入至少一个检查节点。S1007. Insert at least one check node in an intermediate position between the start-end computing node and the end-end computing node.
S1008、当运行至每一个检查节点时,通过当前的检查节点获取当前延时性能裕量。S1008. When running to each check node, obtain the current delay performance margin through the current check node.
S1009、判断当前延时性能裕量是否充足,若当前延时性能裕量充足,则执行步骤S1010;若当前延时性能裕量不充足,则执行步骤S1011。S1009. Determine whether the current delay performance margin is sufficient. If the current delay performance margin is sufficient, perform step S1010; if the current delay performance margin is insufficient, perform step S1011.
S1010、选择存储优化策略对当前的检查节点之后的计算节点进行优化;存储优化策略包括:将当前的检查节点之后的计算节点在计算时产生的数据存储至高访问延迟的存储空间。S1010. Select a storage optimization strategy to optimize the computing nodes after the current check node; the storage optimization strategy includes: storing data generated during the calculation of the computing nodes after the current check node in a storage space with high access latency.
S1011、获取当前的检查节点之后的计算节点在计算时产生的数据的大小,并将计算节点在计算时产生的数据的大小与预设存储空间的大小进行比较,若计算节点在计算时产生的数据的大小超出预设存储空间的大小,则执行步骤S1012,若计算节点在计算时产生的数据的大小未超出存储内存大小,则执行步骤S1013。S1011, Obtain the size of the data generated by the computing node during the calculation after the current check node, and compare the size of the data generated by the computing node during the calculation with the size of the preset storage space, if the computing node generates the data during the calculation If the size of the data exceeds the size of the preset storage space, step S1012 is executed, and if the size of the data generated by the computing node during calculation does not exceed the size of the storage memory, step S1013 is executed.
S1012、拆分当前的检查节点之后的计算节点,并将拆分后的计算节点在计算时产生的数据存储至低访问延迟的存储空间。S1012. Split the computing node after the current check node, and store the data generated during the calculation of the split computing node in a storage space with low access latency.
S1013、将当前的检查节点之后的计算节点在计算时产生的数据存储至低访问延迟的存储空间。S1013. Store the data generated during the calculation of the computing node after the current check node in a storage space with low access latency.
综上所有实施例,示例性说明其优化方法的步骤,例如,以如图8B所示的计算图为例进行说明,假设图8B中最长路径上计算节点1、计算节点2、和计算节点3的总的预设计算消耗时长为T,内存占用为M,计算机设备通过计算量或预估时长计算模型预估出的计算节点1、计算节点2、和计算节点3的计算消耗时长比例为1:5:6,设定预设时长阈值为th,则当计算机设备运行至检查节点a时,检查节点a之前的计算节点1和计算节点2实际计算消耗时长为Tr,则当前延时性能裕量为T*6/12-Tr,若T*6/12-Tr>th,则对应采用存储优化策略对检查节点之后的计算节点进行优化,具体的将计算节点3中计算所需的中间结果或临时变量存入高访问延迟存储空间(片外GPU存储空间或TPU存储空间);若T*6/12-Tr≤th,则对应采用延时优化策略对检查节点之后的计算节点进行优化,具体将计算节点3中计算所需的中间结果或临时变量存入低访问延迟存储空间(内存或缓存),在存入之前,可以先判断计算节点3计算所需内存空间或缓存空间的大小是否大于现有内存空间或缓存空间M,若大于,则对应采用其它延时优化策略,比如,拆分计算节点3,并将拆分后的计算节点中计算所需的中间结果或临时变量存入低访问延迟存储空间(内存或缓存)。In summary of all the above embodiments, the steps of the optimization method are illustrated by way of example. For example, the calculation graph shown in FIG. 8B is taken as an example for description, assuming that the calculation node 1, the calculation node 2, and the calculation node are on the longest path in FIG. 8B The total preset calculation consumption time of 3 is T, and the memory usage is M. The calculation time ratio of computing node 1, computing node 2, and computing node 3 estimated by the computer equipment through the calculation amount or the estimated time calculation model is 1:5:6, set the preset time threshold to th, then when the computer equipment runs to check node a, the actual calculation time of calculation node 1 and calculation node 2 before check node a is Tr, then the current delay performance The margin is T*6/12-Tr. If T*6/12-Tr>th, the storage optimization strategy is used to optimize the computing nodes after the check node. Specifically, the intermediate required for calculation in computing node 3 is optimized. The results or temporary variables are stored in the high-access latency storage space (off-chip GPU storage space or TPU storage space); if T*6/12-Tr≤th, the corresponding delay optimization strategy is used to optimize the computing nodes after the check node , Specifically store the intermediate results or temporary variables required for the calculation in the computing node 3 into the low-access latency storage space (memory or cache). Before storing, you can first determine the size of the memory space or cache space required for the computing node 3 calculation Is it larger than the existing memory space or cache space M, if it is larger, other delay optimization strategies should be adopted accordingly, such as splitting computing node 3, and storing intermediate results or temporary variables required for calculation in the split computing node Enter low access latency storage space (memory or cache).
应该理解的是,虽然图2-11的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-11中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行。It should be understood that although the various steps in the flowcharts of FIGS. 2-11 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in Figure 2-11 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. These sub-steps or stages The order of execution is not necessarily sequential.
在一个实施例中,如图12所示,提供了一种计算图的优化装置,包括:第一获取模块11、插入模模块12、第二获取模块13、确定模块14和优化模块15,其中:第一获取模块11,用于获取计 算网络模型的计算图;计算图中包括多个计算节点;插入模块12,用于在计算图中插入至少一个检查节点;第二获取模块13,用于当运行至每一个检查节点时,通过当前的检查节点获取当前性能裕量;确定模块14,用于根据当前性能裕量,确定优化策略;优化模块15,用于根据优化策略对当前的检查节点之后的计算节点进行优化。关于计算图的优化装置的具体限定可以参见上文中对于一种计算图的优化方法的限定,在此不再赘述。上述计算图的优化装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现以下步骤:获取计算网络模型的计算图;计算图中包括多个计算节点;在计算图中插入至少一个检查节点;当运行至每一个检查节点时,通过当前的检查节点获取当前延时性能裕量;根据当前延时性能裕量,确定优化策略;根据优化策略对当前的检查节点之后的计算节点进行优化。上述实施例提供的一种计算机设备,其实现原理和技术效果与上述方法实施例类似,在此不再赘述。In one embodiment, as shown in FIG. 12, an optimization device for a calculation graph is provided, including: a first acquisition module 11, an insertion mold module 12, a second acquisition module 13, a determination module 14, and an optimization module 15, wherein : The first obtaining module 11 is used to obtain the calculation graph of the calculation network model; the calculation graph includes a plurality of calculation nodes; the insertion module 12 is used to insert at least one check node in the calculation graph; the second obtaining module 13 is used to When running to each inspection node, the current performance margin is obtained through the current inspection node; the determining module 14 is used to determine the optimization strategy according to the current performance margin; the optimization module 15 is used to determine the current inspection node according to the optimization strategy The subsequent computing nodes are optimized. Regarding the specific definition of the optimization device of the calculation graph, please refer to the above definition of the optimization method of the calculation graph, which will not be repeated here. Each module in the optimization device of the above calculation graph can be implemented in whole or in part by software, hardware and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules. In one embodiment, a computer device is provided, including a memory and a processor, and a computer program is stored in the memory. When the processor executes the computer program, the following steps are implemented: obtaining a calculation graph of a calculation network model; A computing node; insert at least one check node in the calculation graph; when running to each check node, obtain the current delay performance margin through the current check node; determine the optimization strategy according to the current delay performance margin; according to the optimization The strategy optimizes the computing nodes after the current check node. The implementation principle and technical effect of a computer device provided by the foregoing embodiment are similar to those of the foregoing method embodiment, and will not be repeated here.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时还实现以下步骤:获取计算网络模型的计算图;计算图中包括多个计算节点;在计算图中插入至少一个检查节点;当运行至每一个检查节点时,通过当前的检查节点获取当前性能裕量;根据当前性能裕量,确定优化策略;根据优化策略对当前的检查节点之后的计算节点进行优化。上述实施例提供的一种计算机可读存储介质,其实现原理和技术效果与上述方法实施例类似,在此不再赘述。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the following steps are also implemented: obtaining a calculation graph of a calculation network model; the calculation graph includes multiple calculations. Node; insert at least one check node in the calculation graph; when running to each check node, obtain the current performance margin through the current check node; determine the optimization strategy according to the current performance margin; determine the current check node according to the optimization strategy The subsequent computing nodes are optimized. The foregoing embodiment provides a computer-readable storage medium, and its implementation principle and technical effect are similar to those of the foregoing method embodiment, and will not be repeated here.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双倍数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实 现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的计算处理设备中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。例如,图13示出了可以实现本发明的方法的计算处理设备。该计算处理设备可以为计算机设备,该计算处理设备传统上包括处理器1010和以存储器1020形式的计算机程序产品或者计算机可读介质。存储器1020具有用于执行上述方法中的任何方法步骤的程序代码1031的存储空间1030。例如,用于程序代码的存储空间1030可以包括分别用于实现上面的方法中的各种步骤的各个程序代码1031。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图14所述的便携式或者固定存储单元。该存储单元可以具有与图13的计算处理设备中的存储器1020类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码1031’,即可以由例如诸如1010之类的处理器读取的代码,这些代码当由计算处理设备运行时,导致该计算处理设备执行上面所描述的方法中的各个步骤。本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。The various component embodiments of the present invention may be implemented by hardware, or by software modules running on one or more processors, or by a combination of them. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the computing processing device according to the embodiments of the present invention. The present invention can also be implemented as a device or device program (for example, a computer program and a computer program product) for executing part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may have the form of one or more signals. Such a signal can be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form. For example, FIG. 13 shows a computing processing device that can implement the method of the present invention. The computing processing device may be a computer device, which traditionally includes a processor 1010 and a computer program product in the form of a memory 1020 or a computer readable medium. The memory 1020 has a storage space 1030 for executing program codes 1031 of any method steps in the above methods. For example, the storage space 1030 for program codes may include various program codes 1031 respectively used to implement various steps in the above method. These program codes can be read from or written into one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards, or floppy disks. Such a computer program product is usually a portable or fixed storage unit as described with reference to FIG. 14. The storage unit may have storage segments, storage spaces, and the like arranged similarly to the memory 1020 in the computing processing device of FIG. 13. The program code can be compressed in an appropriate form, for example. Generally, the storage unit includes computer-readable code 1031', that is, code that can be read by a processor such as 1010, which, when run by a computing processing device, causes the computing processing device to execute the method described above. The various steps. The “one embodiment”, “an embodiment” or “one or more embodiments” referred to herein means that a specific feature, structure, or characteristic described in combination with the embodiment is included in at least one embodiment of the present invention. In addition, please note that the word examples "in one embodiment" herein do not necessarily all refer to the same embodiment. In the instructions provided here, a lot of specific details are explained. However, it can be understood that the embodiments of the present invention can be practiced without these specific details. In some instances, well-known methods, structures, and technologies are not shown in detail, so as not to obscure the understanding of this specification. In the claims, any reference signs placed between parentheses should not be constructed as a limitation to the claims. The word "comprising" does not exclude the presence of elements or steps not listed in the claims. The word "a" or "an" preceding an element does not exclude the presence of multiple such elements. The invention can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the unit claims listing several devices, several of these devices may be embodied in the same hardware item. The use of the words first, second, and third, etc. do not indicate any order. These words can be interpreted as names. The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification. The above-mentioned embodiments only express several implementation modes of the present invention, and their description is relatively specific and detailed, but they should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of the present invention, several modifications and improvements can be made, and these all fall within the protection scope of the present invention. Therefore, the protection scope of the patent of the present invention should be subject to the appended claims.

Claims (16)

  1. 一种计算图的优化方法,其特征在于,所述方法包括:A method for optimizing a calculation graph, characterized in that the method includes:
    获取计算网络模型的计算图;所述计算图中包括多个计算节点;Obtaining a calculation graph of a calculation network model; the calculation graph includes a plurality of calculation nodes;
    在所述计算图中插入至少一个检查节点;Insert at least one check node in the calculation graph;
    当运行至每一个所述检查节点时,通过当前的检查节点获取当前性能裕量;When running to each of the inspection nodes, obtain the current performance margin through the current inspection node;
    根据所述当前性能裕量,确定优化策略;Determine an optimization strategy according to the current performance margin;
    根据所述优化策略对所述当前的检查节点之后的计算节点进行优化。Optimize computing nodes after the current check node according to the optimization strategy.
  2. 根据权利要求1所述的方法,其特征在于,所述当前性能裕量包括当前延时性能裕量,所述根据所述当前性能裕量,确定优化策略,包括:The method according to claim 1, wherein the current performance margin includes a current delay performance margin, and the determining an optimization strategy according to the current performance margin includes:
    若所述当前延时性能裕量充足,则将存储优化策略确定为所述优化策略;所述存储优化策略用于减少所述计算节点在计算时占用的内存;If the current delay performance margin is sufficient, determine a storage optimization strategy as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation;
    若所述当前延时性能裕量不充足,则将延时优化策略确定为所述优化策略;所述延时优化策略用于减少所述计算节点在计算时的消耗时长。If the current delay performance margin is insufficient, the delay optimization strategy is determined as the optimization strategy; the delay optimization strategy is used to reduce the time consumed by the computing node during calculation.
  3. 根据权利要求1所述的方法,其特征在于,所述当前性能裕量包括当前存储性能裕量,所述根据所述当前性能裕量,确定优化策略,包括:The method according to claim 1, wherein the current performance margin includes a current storage performance margin, and the determining an optimization strategy according to the current performance margin includes:
    若所述当前存储性能裕量充足,则将延时优化策略确定为所述优化策略;所述延时优化策略用于减少所述计算节点在计算时的消耗时长;If the current storage performance margin is sufficient, determine the delay optimization strategy as the optimization strategy; the delay optimization strategy is used to reduce the time consumed by the computing node during calculation;
    若所述当前存储性能裕量不充足,则将存储优化策略确定为所述优化策略;所述存储优化策略用于减少所述计算节点在计算时占用的内存。If the current storage performance margin is insufficient, the storage optimization strategy is determined as the optimization strategy; the storage optimization strategy is used to reduce the memory occupied by the computing node during calculation.
  4. 根据权利要求2或3所述的方法,其特征在于,所述存储优化策略包括:The method according to claim 2 or 3, wherein the storage optimization strategy comprises:
    将所述检查节点之后的计算节点在计算时产生的数据存储至高访问延迟的存储空间;所述高访问延迟的存储空间至少包括全局内存和片外存储器;Storing the data generated by the computing node after the check node during calculation in a storage space with high access latency; the storage space with high access latency includes at least global memory and off-chip memory;
    和/或,所述延时优化策略包括:将所述检查节点之后的计算节点在计算时产生的数据存储至低访问延迟的存储空间;所述低访问延迟的存储空间至少包括缓存空间和片内存储。And/or, the delay optimization strategy includes: storing the data generated by the computing nodes after the check node during calculation in a storage space with low access delay; the storage space with low access delay includes at least a cache space and a slice Internal storage.
  5. 根据权利要求4所述的方法,其特征在于,所述延时优化策略还包括:The method according to claim 4, wherein the delay optimization strategy further comprises:
    获取所述当前的检查节点之后的计算节点在计算时产生的数据的大小;Acquiring the size of the data generated by the computing node after the current check node during calculation;
    将所述计算节点在计算时产生的数据的大小与预设存储空间的大小进行比较;Comparing the size of the data generated by the computing node during calculation with the size of the preset storage space;
    若所述计算节点在计算时产生的数据的大小超出所述预设存储空间的大小,则拆分所述当前的检查节点之后的计算节点,并将拆分后的计算节点在计算时产生的数据存储至所述低访问延迟的存储空间;If the size of the data generated by the computing node during the calculation exceeds the size of the preset storage space, the computing node after the current check node is split, and the computing node after the split is generated during the calculation. Storing data in the storage space with low access latency;
    若所述计算节点在计算时产生的数据的大小未超出所述预设存储空间的大小,则将所述当前的检查节点之后的计算节点在计算时产生的数据存储至所述低访问延迟的存储空间。If the size of the data generated by the computing node during the calculation does not exceed the size of the preset storage space, then the data generated during the calculation by the computing node after the current check node is stored in the low access latency storage.
  6. 根据权利要求2所述的方法,其特征在于,所述通过当前的检查节点获取当前性能裕量,包括:The method according to claim 2, wherein the obtaining the current performance margin through the current check node comprises:
    获取所述当前的检查节点之前的所有计算节点的第一总目标计算消耗时长和总实际计算消耗时长;Acquiring the first total target calculation consumption time length and the total actual calculation consumption time length of all computing nodes before the current check node;
    根据所述第一总目标计算消耗时长和所述总实际计算消耗时长,确定所述当前性能裕量。Determine the current performance margin according to the first total target calculation consumption time length and the total actual calculation consumption time length.
  7. 根据权利要求6所述的方法,其特征在于,所述获取所述当前的检查节点之前的所有计算节点的第一总目标计算消耗时长,包括:The method according to claim 6, wherein the obtaining the first total target calculation consumption time length of all the computing nodes before the current check node comprises:
    获取所述当前的检查节点所在路径上的所有计算节点的第二总目标计算消耗时长;Acquiring the second total target calculation consumption time length of all computing nodes on the path where the current check node is located;
    根据所述第二总目标计算消耗时长和预设比例,确定所述第一总目标计算消耗时长;所述预设比例为所述当前的检查节点之前的所有计算节点的总计算消耗时长在所述检查节点所在路径上的所有计算节点的总计算消耗时长中所占比例。According to the second total target calculation consumption time and a preset ratio, determine the first total target calculation consumption time; the preset ratio is the total calculation consumption time of all computing nodes before the current check node. The proportion of the total computing time consumed by all computing nodes on the path where the check node is located.
  8. 根据权利要求1所述的方法,其特征在于,所述在所述计算图中插入至少一个检查节点,包括:The method according to claim 1, wherein the inserting at least one check node in the calculation graph comprises:
    获取所述计算图中最长路径上计算节点的计算消耗时长比例;Acquiring a calculation time-consuming ratio of a calculation node on the longest path in the calculation graph;
    根据所述计算消耗时长比例,在所述最长路径上确定至少一个所述检查节点的插入位置;Determining the insertion position of at least one of the check nodes on the longest path according to the proportion of the calculated time spent;
    在所述至少一个所述检查节点的插入位置处,插入至少一个所述检查节点。Insert at least one inspection node at the insertion position of the at least one inspection node.
  9. 根据权利要求8所述的方法,其特征在于,所述获取所述计算图中最长路径上计算节点的计算消耗时长比例,包括:8. The method according to claim 8, wherein the obtaining the calculation consumption time ratio of the calculation node on the longest path in the calculation graph comprises:
    获取所述最长路径上各计算节点的计算量;Acquiring the calculation amount of each computing node on the longest path;
    根据各所述计算节点的计算量获取所述最长路径上各计算节点的计算消耗时长;Acquiring, according to the calculation amount of each of the computing nodes, the computing consumption time length of each computing node on the longest path;
    根据所述最长路径上各计算节点的计算消耗时长,确定所述最长路径上各计算节点的计算消耗时长比例。Determine the proportion of the computing time consumed by each computing node on the longest path according to the computing time consumed by each computing node on the longest path.
  10. 根据权利要求8所述的方法,其特征在于,所述获取所述计算图中最长路径上各计算节点的计算消耗时长比例,包括:The method according to claim 8, characterized in that said obtaining the calculation consumption time ratio of each calculation node on the longest path in the calculation graph comprises:
    构建消耗时长预估模型;Build a model for estimating the consumption time;
    采用所述消耗时长预估模型,获取所述最长路径上各所述计算节点的计算消耗时长;Using the consumption time estimation model to obtain the calculation consumption time of each of the computing nodes on the longest path;
    根据所述最长路径上各所述计算节点的计算消耗时长,确定所述最长路径上各所述计算节点的计算消耗时长比例。Determine the proportion of the computing time consumed by each of the computing nodes on the longest path according to the computing time consumed by each of the computing nodes on the longest path.
  11. 根据权利要求9或10所述的方法,其特征在于,所述根据所述计算消耗时长比例,在所述最长路径上确定至少一个所述检查节点的插入位置,包括:The method according to claim 9 or 10, wherein the determining the insertion position of at least one of the check nodes on the longest path according to the proportion of time consumed by the calculation comprises:
    根据所述计算消耗时长比例,将所述最长路径均分成预设数量的多个子路径;Dividing the longest path into a preset number of multiple sub-paths according to the proportion of the calculated consumption time;
    在所述多个子路径中选择至少一个子路径作为插入所述检查点的插入位置。At least one sub-path is selected from the plurality of sub-paths as an insertion position for inserting the check point.
  12. 根据权利要求1或8所述的方法,其特征在于,所述在所述计算图中插入至少一个检查节点,包括:The method according to claim 1 or 8, wherein the inserting at least one check node in the calculation graph comprises:
    获取所述计算图中间隔至少一个计算节点的始端计算节点和末端计算节点;Acquiring a start-end computing node and an end-end computing node that are separated by at least one computing node in the calculation graph;
    在所述始端计算节点和末端计算节点的中间位置插入至少一个所述检查节点。At least one check node is inserted in an intermediate position between the start-end computing node and the end-end computing node.
  13. 根据权利要求1所述的方法,其特征在于,所述获取计算网络模型的计算图,包括:The method according to claim 1, wherein said obtaining a calculation graph of a calculation network model comprises:
    加载所述计算网络模型的拓扑结构和参数;Loading the topological structure and parameters of the computational network model;
    对所述计算网络模型的拓扑结构和参数进行编译,得到所述计算网络模型的计算图。Compiling the topological structure and parameters of the computational network model to obtain the computational graph of the computational network model.
  14. 一种计算图的优化装置,其特征在于,所述装置包括:A computing graph optimization device, characterized in that the device includes:
    第一获取模块,用于获取计算网络模型的计算图;所述计算图中包括多个计算节点;The first obtaining module is configured to obtain a calculation graph of a calculation network model; the calculation graph includes a plurality of calculation nodes;
    插入模块,用于在所述计算图中插入至少一个检查节点;Inserting module, for inserting at least one check node in the calculation graph;
    第二获取模块,用于当运行至每一个所述检查节点时,通过当前的检查节点获取当前性能裕量;The second obtaining module is used to obtain the current performance margin through the current check node when running to each of the check nodes;
    确定模块,用于根据所述当前性能裕量,确定优化策略;The determining module is used to determine an optimization strategy according to the current performance margin;
    优化模块,用于根据所述优化策略对所述当前的检查节点之后的计算节点进行优化。The optimization module is used to optimize the computing nodes after the current check node according to the optimization strategy.
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至13中任一项所述方法的步骤。A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 13 when the computer program is executed by the processor.
  16. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至13中任一项所述的方法的步骤。A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the steps of the method according to any one of claims 1 to 13 when the computer program is executed by a processor.
PCT/CN2020/113290 2019-12-09 2020-09-03 Optimization method and apparatus for computation graph, computer device, and storage medium WO2021114757A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911249112.XA CN111158901B (en) 2019-12-09 2019-12-09 Optimization method, optimization device, computer equipment and storage medium for calculation graph
CN201911249112.X 2019-12-09

Publications (1)

Publication Number Publication Date
WO2021114757A1 true WO2021114757A1 (en) 2021-06-17

Family

ID=70555798

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113290 WO2021114757A1 (en) 2019-12-09 2020-09-03 Optimization method and apparatus for computation graph, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN111158901B (en)
WO (1) WO2021114757A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111158901B (en) * 2019-12-09 2023-09-08 爱芯元智半导体(宁波)有限公司 Optimization method, optimization device, computer equipment and storage medium for calculation graph
CN114003306B (en) * 2021-10-27 2024-03-15 上海商汤科技开发有限公司 Video memory optimization method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339252A (en) * 2015-07-08 2017-01-18 阿里巴巴集团控股有限公司 Self-adaptive optimization method and device for distributed DAG system
CN107045455A (en) * 2017-06-19 2017-08-15 华中科技大学 A kind of Docker Swarm cluster resource method for optimizing scheduling based on load estimation
CN109189572A (en) * 2018-08-02 2019-01-11 中兴飞流信息科技有限公司 A kind of resource predictor method and system, electronic equipment and storage medium
CN110362611A (en) * 2019-07-12 2019-10-22 拉卡拉支付股份有限公司 A kind of data base query method, device, electronic equipment and storage medium
CN111158901A (en) * 2019-12-09 2020-05-15 北京迈格威科技有限公司 Optimization method and device of computation graph, computer equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8504970B1 (en) * 2011-07-05 2013-08-06 Altera Corporation Method and apparatus for performing automated timing closure analysis for systems implemented on target devices
US10970628B2 (en) * 2015-11-09 2021-04-06 Google Llc Training neural networks represented as computational graphs
US10346206B2 (en) * 2016-08-27 2019-07-09 International Business Machines Corporation System, method and computer program product for resource management in a distributed computation system
US10592280B2 (en) * 2016-11-23 2020-03-17 Amazon Technologies, Inc. Resource allocation and scheduling for batch jobs
CN107547746B (en) * 2017-08-31 2020-09-04 Oppo广东移动通信有限公司 Resource allocation method and related product
US10965536B2 (en) * 2019-03-30 2021-03-30 Intel Corporation Methods and apparatus to insert buffers in a dataflow graph
CN110515739B (en) * 2019-10-23 2020-01-31 上海燧原智能科技有限公司 Deep learning neural network model load calculation method, device, equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339252A (en) * 2015-07-08 2017-01-18 阿里巴巴集团控股有限公司 Self-adaptive optimization method and device for distributed DAG system
CN107045455A (en) * 2017-06-19 2017-08-15 华中科技大学 A kind of Docker Swarm cluster resource method for optimizing scheduling based on load estimation
CN109189572A (en) * 2018-08-02 2019-01-11 中兴飞流信息科技有限公司 A kind of resource predictor method and system, electronic equipment and storage medium
CN110362611A (en) * 2019-07-12 2019-10-22 拉卡拉支付股份有限公司 A kind of data base query method, device, electronic equipment and storage medium
CN111158901A (en) * 2019-12-09 2020-05-15 北京迈格威科技有限公司 Optimization method and device of computation graph, computer equipment and storage medium

Also Published As

Publication number Publication date
CN111158901B (en) 2023-09-08
CN111158901A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
US9031826B2 (en) Method and apparatus for simulating operation in a data processing system
CN103999056B (en) The method, apparatus and system of management work load memory distribution
US11500959B2 (en) Multiple output fusion for operations performed in a multi-dimensional array of processing units
CN109669772B (en) Parallel execution method and equipment of computational graph
JP6265033B2 (en) Process migration method, computer system operating to perform process migration, intermediate computing resource in such system, and computing resource selection method before partitioning for process migration method
Li et al. Sculptor: Flexible approximation with selective dynamic loop perforation
WO2021114757A1 (en) Optimization method and apparatus for computation graph, computer device, and storage medium
Pinto et al. Refactoring for energy efficiency: A reflection on the state of the art
Huang et al. Novel heuristic speculative execution strategies in heterogeneous distributed environments
US20130318540A1 (en) Data flow graph processing device, data flow graph processing method, and data flow graph processing program
Yu et al. System-wide trade-off modeling of performance, power, and resilience on petascale systems
US20190220257A1 (en) Method and apparatus for detecting inter-instruction data dependency
Deng et al. Cost-driven autonomous mobility
US8661424B2 (en) Auto-generation of concurrent code for multi-core applications
CN117291260A (en) Deep learning framework adaptation method, deep learning framework adaptation device, deep learning framework adaptation equipment, deep learning framework adaptation storage medium and deep learning framework adaptation product
KR20150101870A (en) Method and apparatus for avoiding bank conflict in memory
Kersten et al. A Hoare logic for energy consumption analysis
JP2009075965A (en) Software development method and software development device
CN114021733B (en) Model training optimization method, device, computer equipment and storage medium
JP5687603B2 (en) Program conversion apparatus, program conversion method, and conversion program
CN102063308B (en) Method for controlling processing flow of seismic prospecting data
US11068250B2 (en) Crowdsourced API resource consumption information for integrated development environments
CN105718223B (en) The method and apparatus of management work load memory distribution
McKean et al. Use of model‐based architecture attributes to construct a component‐level trade space
Toldin et al. Soundness proof for a Hoare logic for energy consumption analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20898036

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20898036

Country of ref document: EP

Kind code of ref document: A1