WO2023071149A1 - Video memory optimization method and apparatus, device, storage medium and program product - Google Patents

Video memory optimization method and apparatus, device, storage medium and program product Download PDF

Info

Publication number
WO2023071149A1
WO2023071149A1 PCT/CN2022/093101 CN2022093101W WO2023071149A1 WO 2023071149 A1 WO2023071149 A1 WO 2023071149A1 CN 2022093101 W CN2022093101 W CN 2022093101W WO 2023071149 A1 WO2023071149 A1 WO 2023071149A1
Authority
WO
WIPO (PCT)
Prior art keywords
graph
video memory
calculation
computation
target
Prior art date
Application number
PCT/CN2022/093101
Other languages
French (fr)
Chinese (zh)
Inventor
赵成钢
颜子杰
张宇帆
张行程
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023071149A1 publication Critical patent/WO2023071149A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44568Immediately runnable code
    • G06F9/44578Preparing or optimising for loading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Definitions

  • Embodiments of the present disclosure relate to the technical field of data processing, and relate to, but are not limited to, a video memory optimization method, device, equipment, storage medium, and program product.
  • Embodiments of the present disclosure provide a technical solution for video memory optimization.
  • An embodiment of the present disclosure provides a video memory optimization method, the method comprising:
  • the video memory space required by the preset network model is determined.
  • the generating the first computation graph based on the preset network model includes: generating computation graph information in a data exchange format based on the preset network model; based on the operator queue in the computation graph information , generating a first computation graph matching the computation graph information.
  • the first calculation graph can be obtained, and then the first calculation graph can be adjusted through various optimization schemes to generate multiple second calculation graphs for easy screening. Computational graph with optimal memory cost.
  • the determining the correlation between the peak value of the video memory in the first calculation graph and the running data includes: determining the occurrence time of the peak value of the video memory in the first calculation graph; The operation data of the operator; determine the generation time of the operation data and the application time of the operation data in the first calculation graph; determine the generation time and the application time, and the occurrence of the video memory peak
  • the timing relationship between moments is the association relationship.
  • adjusting the first calculation graph based on the association relationship to generate at least one second calculation graph includes: determining that the association relationship meets a preset condition in the first calculation graph The target operation data; based on the target operation data, adjust the first calculation graph to generate the at least one second calculation graph.
  • the second calculation graph can be obtained by moving the operator corresponding to the target operation data in the first calculation graph, so that the peak value of the second calculation graph can be reduced.
  • determining the target operation data whose association relationship satisfies a preset condition includes: determining in the operation data of operators in the first calculation graph The operation data whose generation time is before the occurrence time of the video memory peak value and whose application time is after the occurrence time of the video memory peak value is the target operation data satisfying the preset condition.
  • the adjusting the first calculation graph based on the target operation data to generate the at least one second calculation graph includes: determining the target in the first calculation graph Running the target operator corresponding to the data; adjusting the target operator in the first calculation graph based on the occurrence time of the video memory peak in the first calculation graph to generate the at least one second computation graph. In this way, for the first computation graph, by moving the target operator according to the occurrence time of the video memory peak in the first computation graph, multiple second computation graphs can be generated.
  • the adjusting the target operator in the first computation graph based on the occurrence time of the video memory peak value in the first computation graph to generate the at least one second computation graph includes : In the first computation graph, after the execution time of the target operator is adjusted to the occurrence time of the video memory peak value, the second computation graph is generated. In this way, by moving the target operator to the peak value of the video memory, the peak value of the newly generated first computation graph can be reduced, thereby optimizing the video memory space required by the second computation graph.
  • the determining the target computing graph in the at least one second computing graph based on the peak video memory and running time of the at least one second computing graph includes: obtaining a preset video memory overhead and a preset trade-off Ratio; wherein, the preset trade-off ratio is used to weigh the proportion between the running time of the calculation graph and the required video memory; based on the preset video memory overhead and the preset trade-off ratio, the Scoring the peak value of video memory, running time and the running time of the corresponding first computing graph to obtain the scoring result of each second computing graph; based on the scoring result of each second computing graph, at least The second computation graphs in a second computation graph are sorted to obtain a sort queue; based on the sort queue, the target computation graph is determined in the at least one second computation graph. In this way, the target calculation graph with the optimal memory overhead can be found through queue search.
  • scoring the video memory peak value, running time of each second computing graph and the running time of the corresponding first computing graph obtaining the scoring result of each second computation graph, including: determining each second computation graph based on the peak video memory value of each second computation graph, the preset video memory overhead, and the preset trade-off ratio The video memory score; based on the preset trade-off ratio, the runtime of each second computation graph and the runtime of the corresponding first computation graph, determine the runtime score of each second computation graph; A scoring result of each second computing image is determined based on the video memory score and the running time score of each second computing graph. In this way, in the stage of scoring the second computing graph, the running time of the second computing graph and the peak value of video memory are comprehensively considered, so that the final computing graph can optimize the video memory space without sacrificing a lot of time cost.
  • the determining the target computation graph in the at least one second computation graph based on the sorting queue includes: searching for the first candidate computation with the best scoring result in the ranking queue Graph; determining that the first candidate computation graph is the target computation graph in response to the found video memory space required by the first candidate computation graph meeting the preset display memory overhead. In this way, the number of searches can be reduced as much as possible, and the memory overhead of the searched target calculation graph can be reduced.
  • the determining the target computation graph in the at least one second computation graph based on the sorting queue includes: responding to the fact that the video memory space required by the first candidate computation graph does not meet the required The preset video memory overhead, based on the target operating data of the first candidate computation graph, adjust the first candidate computation graph to obtain at least one third computation graph; based on the scoring result of the at least one third computation graph Updating the arrangement queue to obtain an updated arrangement queue; in the updated arrangement queue, searching whether the video memory space required by the second candidate calculation graph with the best scoring result satisfies the preset video memory overhead; in response The video memory space required by the second candidate calculation graph does not meet the preset video memory overhead, and the number of searches reaches the preset number threshold, and the calculation graph with the best scoring result in the queue corresponding to the last search is determined to be the target calculation graph picture. In this way, if the calculation graph that meets the preset memory overhead cannot be searched, the calculation graph with the best score in the latest arrangement queue is used as the target calculation graph, so that
  • the determining the video memory space required by the preset network model based on the target computation graph includes: determining the video memory space required by the target computation graph as training the preset network The video memory space required by the model. In this way, the video memory space required for training the network model can be optimized without sacrificing the running time required for training the network model.
  • An embodiment of the present disclosure provides a video memory optimization device, the device comprising:
  • the first generation module is configured to generate a first calculation graph based on a preset network model
  • the first determination module is configured to determine the correlation between the peak value of the video memory of the first calculation graph and the running data
  • the second generation module is configured to adjust the first calculation graph based on the association relationship, and generate at least one second calculation graph;
  • the second determination module is configured to determine a target calculation graph in the at least one second calculation graph based on the peak value of the video memory and the running time of the at least one second calculation graph;
  • the third determination module is configured to determine the video memory space required by the preset network model based on the target computation graph.
  • An embodiment of the present disclosure provides a computer storage medium, on which computer-executable instructions are stored. After the computer-executable instructions are executed, the above video memory optimization method can be realized.
  • An embodiment of the present disclosure provides a computer device.
  • the computer device includes a memory and a processor.
  • Computer-executable instructions are stored in the memory.
  • the processor runs the computer-executable instructions in the memory, the above-mentioned Memory optimization method.
  • An embodiment of the present disclosure also provides a computer program product, the computer program product includes computer readable code, and when the computer readable code runs in an electronic device, a processor of the electronic device executes any one of the above-mentioned The video memory optimization method described in the embodiment.
  • Embodiments of the present disclosure provide a video memory optimization method, device, device, storage medium, and program product.
  • the acquired preset network model firstly, by generating the first calculation graph, and analyze the correlation between the peak video memory required for operation in the first calculation graph and the operating data of each operator, so that at least one second calculation graph can be generated by optimizing the first calculation graph; then, Comprehensively consider the peak value of the video memory of the second computing graph and the running time required to run the second computing graph, and search for the target computing graph in multiple second computing graphs; in this way, the searched target computing graph can optimize the video memory space, It can also take into account the running time. Finally, through the target calculation graph, the optimization of the video memory space required by the preset network model is realized.
  • the time overhead of the computing graph is also integrated into the consideration of the computing graph , so that the final target calculation graph satisfies both the space budget and the time budget, thereby optimizing the video memory space without sacrificing a lot of time cost.
  • FIG. 1 is a schematic diagram of an implementation flow of a video memory optimization method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure.
  • FIG. 3A is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure.
  • FIG. 3B is a schematic diagram of an application scenario of a video memory optimization method provided by an embodiment of the present disclosure
  • FIG. 4A is a schematic flow diagram of yet another implementation of the video memory optimization method provided by the embodiment of the present disclosure.
  • FIG. 4B is a schematic diagram of another application scenario of the video memory optimization method provided by the embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of the curve change of the calculation graph occupied by the video memory provided by the embodiment of the present disclosure
  • FIG. 6 is a schematic diagram of an implementation flow of a video memory optimization method provided by an embodiment of the present disclosure
  • FIG. 7 is a schematic diagram of the structure and composition of a video memory optimization device according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of the composition and structure of a computer device according to an embodiment of the present disclosure.
  • first ⁇ second ⁇ third is only used to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, “first ⁇ second ⁇ third” Where permitted, the specific order or sequencing may be interchanged such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein.
  • Calculation graph which is used to graphically represent the calculation process.
  • a computational graph is a "language” for describing equations. Since it is a graph, it has nodes (for example, variables) and edges (operations (for example, simple functions)).
  • edges for example, simple functions
  • a neural network model can essentially be represented by a computational graph, and its training process can be divided into three parts: forward propagation, back propagation, and parameter update.
  • Video memory also called frame buffer, is used to store the data processed by the graphics card chip or the rendering data to be extracted.
  • video memory is a component used to store graphics information to be processed.
  • the exemplary application of the video memory optimized device provided by the embodiment of the present disclosure is described below.
  • the device provided by the embodiment of the present disclosure can be implemented as a notebook computer, a tablet computer, a desktop computer, a camera, a mobile device (for example, a personal digital
  • a mobile device for example, a personal digital
  • Various types of user terminals such as assistants, dedicated messaging devices, portable game devices, etc., can also be implemented as servers.
  • an exemplary application when the device is implemented as a terminal or a server will be described.
  • the method can be applied to a computer device, and the functions realized by the method can be realized by calling a program code by a processor in the computer device.
  • the program code can be stored in a computer storage medium.
  • the computer device includes at least a processor and a storage device. medium.
  • An embodiment of the present disclosure provides a video memory optimization method, as shown in FIG. 1 , which is described in conjunction with the steps shown in FIG. 1 :
  • Step S101 generating a first computation graph based on a preset network model.
  • the preset network model may be any type of network model to be trained, such as a deep neural network model to be trained, a residual network model to be trained, or any large-scale neural network model to be trained.
  • the first calculation graph is a graphical representation of the calculation process of the preset network model, including connected nodes and edges. Wherein, the nodes represent each operator executing tasks in the computation graph, and the edges are used to connect each operator according to the execution sequence of each task in the network model.
  • the preset network model is input into a deep learning framework (such as tensorflow, pytorch, or parrots, etc.), and the network model is generated into calculation graph information in JavaScript Object Notation (JSON) format ; Arrange the operator queue by reading in the calculation graph information in JSON format, and match the input data and output data of each operator with the operation data in turn, so as to generate a first calculation graph; in this way, based on each deep learning framework, generate Computational graph information in JSON format can generate a computational graph.
  • JSON JavaScript Object Notation
  • the network model is used as an image recognition network to be trained, which includes an input layer, a convolutional layer, a pooling layer, and a fully connected layer; first, the image recognition network is input to different depths The calculation graph information in JSON format is extracted in the learning framework, and the calculation graph information in JSON format corresponding to each framework is obtained.
  • the input layer, convolutional layer, pooling layer and fully connected layer are respectively represented as execution Operators of different tasks; then, according to the execution order of these operators, different operators are connected through operation edges to obtain the first calculation graph representing the image recognition network.
  • Step S102 determining the correlation between the peak value of the video memory and the running data of the first calculation graph.
  • the peak value of the video memory of the first computation graph is the peak value of the video memory space required by the first computation graph during operation. By traversing the first computation graph, it is possible to determine the video memory space required to run all the operators in the first computation graph sequentially, as well as the video memory peak value that occurs during the entire running process.
  • the running data refers to the running data of the first computation graph.
  • the running data of the first computation graph includes: data generated by each operator during the process of traversing the first computation graph.
  • the peak value of the video memory required by the first computation graph and the operating data of each operator can be obtained, so as to analyze the time when the video memory peak value appears and the operating data of each operator
  • the timing relationship between the generation time and the application time of the running data; and the timing relationship is used as the association relationship between the video memory peak value of the first calculation graph and the running data.
  • the generation time of the operation data of some operators and the application time of the operation data are both after the occurrence time of the peak value of the video memory, or the generation time of the operation data of some operators and the application time of the operation data are both before the time of the peak value of the video memory.
  • the running data generation time of some operators is before the time when the video memory peak occurs, and the application time of the running data is after the video memory peak time.
  • Step S103 based on the association relationship, adjust the first computation graph to generate at least one second computation graph.
  • At least one of the first calculation graphs is determined according to the timing relationship between the occurrence time of the video memory peak in the first calculation graph, the generation time of the operator's operation data, and the application time of the operation data.
  • An optimization scheme ; and adjusting the first calculation graph based on the optimization scheme to generate a second calculation graph corresponding to each optimization scheme.
  • the first computation graph is adjusted by any optimization scheme of the first computation graph, and the adjusted first computation graph is used as the second computation graph, that is, the second computation graph is obtained by It is obtained after adjusting the first calculation graph by adopting the optimization scheme. In this way, in the case of determining multiple optimization schemes for the first calculation graph, multiple second calculation graphs can be obtained by adjusting the first calculation graph one by one through the multiple optimization schemes.
  • the adjustment to the first calculation graph may be to adjust the first calculation graph by analyzing the relationship between the peak value of the video memory and the running data. If the association relationship satisfies the preset condition, that is, if the generation time of the operation data of any operator in the first calculation graph is before the occurrence time of the video memory peak, but the application time of the operation data of the operator is before the occurrence time of the video memory peak After the occurrence time, then move the execution time of the operator in the first calculation graph to after the peak value of the video memory, realize the adjustment to the first calculation graph, and obtain the second calculation graph.
  • the operator corresponding to the target operating data is screened out, and the target operating data is the operating data whose generation time is before the occurrence time of the video memory peak value, and the application time of the operation data is after the video memory peak value occurrence time;
  • At least one second calculation graph is obtained by moving the operator corresponding to the target operation data in the first calculation graph.
  • the preset condition can be set according to the sequence relationship between the time when the operator's running data is generated, the time when the running data is applied, and the time when the peak value of the video memory occurs. For example, setting the preset condition The generation time of the operation data of the operator is before the occurrence time of the video memory peak and the application time of the operation data is after the occurrence time of the video memory peak. In this way, among the operating data of each operator in the first calculation graph, the target operating data whose operating data generation time of the operator is before the occurrence time of the video memory peak value and whose application time is after the video memory peak value occurrence time point is filtered out. Each target operation data is used as an optimization scheme of the first calculation graph, and at least one second calculation graph is generated by moving the operators corresponding to the target operation data.
  • Step S104 Determine a target computation graph in the at least one second computation graph based on the peak value of the video memory and the runtime of the at least one second computation graph.
  • the peak value of the video memory space required by the second computation graph is obtained, that is, the peak value of the video memory of the second computation graph, and the running time required to run the second computation graph, that is, the first 2.
  • the running time of the computation graph can be obtained by combining the running time of the second computation graph and searching for a computation graph whose video memory overhead satisfies the preset video memory space in the plurality of second computation graphs.
  • each second calculation graph is re-run to obtain the peak memory and running time of the second computing graph; by obtaining the set memory overhead budget and setting the trade-off between running time and memory space
  • Step S105 based on the target computation graph, determine the display memory space required by the preset network model.
  • the target computation graph by running the target computation graph, it is possible to estimate the video memory space required for training the preset network model and the time it takes.
  • the running time of the target computing graph is used as the duration of training the preset network model
  • the video memory space required for running the target computing graph is used as the video memory space required for training the preset network model, and as the model increases
  • the optimized video memory space of the target calculation graph will also become larger, so that the video memory space for training the network model can be further optimized.
  • the obtained preset network model firstly, by generating the first calculation graph representing the operation process of the network model, and analyzing the peak value of the video memory required for operation in the first calculation graph and the value of each operator Run the correlation between the data, so that at least one second calculation graph can be generated by optimizing the first calculation graph; then, comprehensively consider the peak value of the video memory of the second calculation graph and the running time required to run the second calculation graph , search for the target computation graph in multiple second computation graphs; in this way, the searched target computation graph can not only optimize the video memory space, but also take into account the running time. Finally, through the target calculation graph, the optimization of the video memory space required by the preset network model is realized.
  • the time overhead of the computing graph is also integrated into the consideration of the computing graph , so that the final target calculation graph satisfies both the space budget and the time budget, thereby optimizing the video memory space without sacrificing a lot of time cost.
  • multiple first calculation graphs are generated by reading in JSON-formatted calculation graph information extracted by different deep learning frameworks, that is, the above step S101 can be implemented through the following steps S111 and S112 (not shown):
  • Step S111 based on the preset network model, generating computation graph information in a data exchange format.
  • the preset network model is input into different deep learning frameworks to extract the calculation graph information of the preset network model in JSON file format, the calculation graph information includes operators performing different tasks, Operands and the input and output of each operator, etc.
  • Step S112 based on the operator queue in the computation graph information, generate a first computation graph matching the computation graph information.
  • each operator by analyzing the order in which each operator performs tasks in the preset neural network, each operator is arranged into a queue according to the order in which the tasks are performed; and the input and output of each operator are sequentially compared with the operands Matching is to connect operators through operation edges to form a first calculation graph; then based on the calculation graph information in JSON format extracted by using different deep learning frameworks, multiple first calculation graphs, that is, the first calculation graph can be obtained.
  • the first calculation graph can be obtained, and then the first calculation graph can be adjusted through various optimization schemes to generate multiple second calculation graphs for easy screening. Computational graph with optimal memory cost.
  • step S102 by analyzing the timing relationship between the generation time and application time of the operator's operating data in the first calculation graph, and the time when the peak value appears, the correlation between the video memory peak value and the operating data in the first calculation graph is obtained, That is, the above step S102 can be realized through the following steps S121 to S124 (not shown in the figure):
  • Step S121 determining the occurrence time of the video memory peak in the first calculation graph.
  • the first computing graph by traversing and running the first computing graph, it is possible to determine the moment when the video memory space required by the first computing graph reaches its peak; that is, during the running of the first computing graph, the peak value of the video memory occurs within the entire running time time.
  • Step S122 determining operation data of operators in the first computation graph.
  • the data generated by each operator in the first calculation graph during operation by traversing and running each first calculation graph, the data generated by each operator in the first calculation graph during operation, the time when the data is generated, and the time when the data is applied, that is, the running data , the generation time of the operation data and the application time of the operation data.
  • Step S123 determining the generation time of the operation data and the application time of the operation data in the first calculation graph.
  • each first computation graph by traversing and running each first computation graph, it is possible to obtain the occurrence time of the video memory peak in the first computation graph, the generation time of the operation data generated by each operator during the operation process, and the application time of the operation data .
  • Step S124 determining the timing relationship between the generation time, the application time, and the occurrence time of the video memory peak as the association relationship.
  • the correlation between the peak video memory and the running data is obtained by analyzing the timing relationship between the generation time of the running data, the application time of the running data, and the occurrence time of the video memory peak. In this way, by analyzing the timing relationship between the time of running data generated by the operator, the time when the running data is applied, and the time when the peak value of the video memory is reached, it can be further determined whether the operator needs to be moved to reduce the peak value.
  • step S103 by analyzing the timing relationship between the generation time of the operator's operation data and the application time of the operation data, and the peak time, it is determined whether the operation data meets the preset conditions, and then the first calculation graph is adjusted. , to generate at least one second calculation graph, that is, the above step S103 can be realized through the steps shown in FIG. 2 .
  • FIG. 2 The steps are described below:
  • Step S201 in the first calculation diagram, determine the target operating data whose association relationship satisfies a preset condition.
  • the operation data of the operators in the first calculation graph it is determined that the generation time is before the occurrence time of the video memory peak and the application time is after the occurrence time of the video memory peak, Running data for a target meeting the preset condition.
  • the operation data of the operators in each first computation graph determine the operation data whose generation time is before the occurrence time of the video memory peak value; in response to the application of the target operation data If the time is after the occurrence time of the video memory peak value, the running data is determined to be the target running data.
  • the operation data generated by each operator in the first calculation graph during operation is obtained, and from the operation data of these operators, it is found that the generation time of the operation data is within the first calculation graph.
  • the running data before the peak value of video memory of a computing graph occurs. That is, the running data is generated by the operator before the peak value of the video memory, which means that the video memory occupied by the running data is included in the peak value of the video memory.
  • the application time of the target operating data After finding the target operating data, determine the application time of the target operating data; and judge whether the application time is after the occurrence time of the video memory peak, if the application time of the operation data is after the occurrence time of the video memory peak value, it means the The time when the operator generates the running data in the process of running the first calculation graph is before the occurrence time of the video memory peak value, but the application time of the running data is after the occurrence time of the video memory peak value; Size, but it is not used before reaching the peak value of the video memory, but after reaching the peak value of the video memory, and such operating data is used as the target operating data.
  • Step S202 based on the target operation data, adjust the first calculation graph to generate the at least one second calculation graph.
  • the second calculation graph is obtained.
  • the target operating data in the case where the generation time of the operating data is before the occurrence time of the video memory peak and the application time of the operating data is after the occurrence time of the display peak is screened in the first calculation graph.
  • the moving target runs the operator corresponding to the data to obtain the second calculation graph, so that the peak value of the second calculation graph can be reduced.
  • the second computation graph is generated by moving the position of the operator corresponding to the target operation data in the first computation graph, and the target computation graph is searched based on information such as the peak value of the video memory of the second computation graph, That is, step S202 in FIG. 2 above can be realized through the steps shown in FIG. 3A .
  • FIG. 3A is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure. The following description is made in conjunction with the steps shown in FIG. 3A :
  • Step S301 in the first computation graph, determine the target operator corresponding to the target operation data.
  • an operator that generates the target operation data that is, a target operator is determined.
  • Step S302 based on the occurrence time of the video memory peak in the first computation graph, adjust the target operator in the first computation graph to generate the at least one second computation graph.
  • the moment when the video memory reaches the peak value in the first computation graph is obtained. Adjust the execution time of the target operator in the first computation graph according to this moment, that is, move the position of the target operator in the first computation graph to obtain the second computation graph. In this way, for the first computation graph, by moving the target operator according to the occurrence time of the video memory peak in the first computation graph, multiple second computation graphs are obtained. As shown in FIG. 3B , if it is determined by analyzing the optimization scheme of the first calculation graph 31 that the first calculation graph 31 includes two target operators, that is, the first calculation graph 31 has two optimization schemes. Any target operator is moved in the first computation graph 31 to generate two optimized computation graphs, that is, the second computation graph 32 and the second computation graph 33 .
  • the first calculation graph is optimized by moving the target operator to after the peak of the video memory to generate the second calculation graph, that is, the above step S302 can be implemented through the following process:
  • the second computation graph is generated after the execution time of the target operator is adjusted to the occurrence time of the video memory peak.
  • a second computation graph corresponding to the first computation graph is generated.
  • a third computation graph with a reduced video memory peak value can be generated, so that it is convenient to search for an existing operator in at least one third computation graph.
  • Figure 4A is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure, combined with the steps shown in Figures 3A and 4A, the following description is made:
  • Step S401 acquiring a preset video memory overhead and a preset trade-off ratio.
  • the preset video memory overhead is a set video memory application amount, that is, a preset video memory space size.
  • the preset trade-off ratio is a parameter set in advance to weigh the proportion of time and space in the score, and the ratio is less than 1.
  • the preset trade-off ratio is used to weigh the proportion between the running time of the calculation graph and the required video memory.
  • Step S402 based on the preset video memory overhead and the preset trade-off ratio, score the video memory peak value, running time of each second computing graph, and the running time of the corresponding first computing graph, and obtain the - Scoring results of the second computation graph.
  • the set preset video memory overhead and the preset trade-off ratio are combined with the peak video memory and runtime of the second computation graph, and at the same time, the original computation graph of the second computation graph is considered comprehensively (that is, the second computation graph is generated
  • the running time of the first computing graph is used to evaluate the memory overhead and running time of the second computing graph, so as to obtain the scoring result of the second computing graph.
  • each second calculation graph is scored, and the scoring result of each second calculation graph is obtained.
  • the scoring result may be a score, and a larger score indicates better overall performance of the second computation graph's memory overhead and runtime.
  • the scoring result of the second computing graph is obtained by comprehensively evaluating the running time of the second computing graph and the peak value of the video memory, that is, the above step S402 can be performed through the following steps S421 to S423 (not shown out) to achieve:
  • Step S421 based on the peak value of the video memory, the preset video memory cost and the preset trade-off ratio of each second computing graph, determine the video memory score of each second computing graph.
  • the peak value of the second computation graph is subtracted from the preset memory overhead to obtain a difference (the difference can be a positive number or a negative number, for example, in If the peak value of the video memory of the second computation graph is greater than the default memory cost, the difference is a negative number; if the peak value of the memory of the second computation graph is smaller than the preset memory cost, the difference is a positive number) .
  • the video memory score of the second computation graph can be obtained by multiplying the difference with the set preset trade-off ratio.
  • Step S422 based on the preset trade-off ratio, the running time of each second computing graph, and the corresponding running time of the first computing graph, determine the running time score of each second computing graph.
  • a preset standard parameter for example, set to 1
  • a preset trade-off ratio is used as the ratio of the estimated running time.
  • the running time of the second calculation graph is subtracted from the running time of the corresponding second computing graph to obtain a time difference (generally, the time difference is a positive number).
  • the running time score of the second calculation graph can be obtained by multiplying the calculated evaluation running time ratio by the time difference value.
  • Step S423 based on the video memory score and the running time score of each second calculation graph, determine the score result of each second calculation image.
  • the score of the second calculation graph can be obtained by adding the video memory score and runtime score of the second calculation graph, so that at least one second calculation graph can be obtained The comprehensive score of video memory and running time of each second computing graph in the graph. In this way, in the stage of scoring the second computing graph, the running time of the second computing graph and the peak value of video memory are comprehensively considered, so that the final computing graph can optimize the video memory space without sacrificing a lot of time cost.
  • Step S403 sort the second computation graphs in the at least one second computation graph based on the scoring result of each second computation graph, to obtain a sorted queue.
  • the second calculation graphs are sorted according to the scores of each second calculation graph in at least one second calculation graph;
  • the second computation graphs in the graph are sorted to obtain the sort queue; or, the second computation graphs in at least one second computation graph are sorted according to the scoring results from small to large to obtain the sort queue.
  • the score of the second calculation 32 is greater than the score of the second calculation 33, the arrangement of the two second calculation graphs is as shown in Figure 3, the second calculation graph 32 is in front, and the second calculation graph 33 after.
  • Step S404 based on the sorting queue, determine the target computation graph in the at least one second computation graph.
  • the sorting queue is arranged based on the score size of the second computing graph, according to the sorting order in the sorting queue, first search whether the video memory overhead of the second computing graph with the highest score satisfies the preset video memory Overhead, if the video memory cost of the second computing graph with the highest score meets the preset video memory cost, use the second computing graph as the target computing graph; if not, continue to analyze whether there is target running data in the second computing graph, to generate an optimization scheme for the second calculation graph; thereby adjust the second calculation graph based on the optimization scheme to generate a third calculation graph, and list the third calculation graph in the sorting queue according to the scoring results of the third calculation graph , continue to search for the memory cost of the calculation graph with the highest score in the updated arrangement object to meet the preset memory cost; finally, when the number of searches reaches the upper limit and no calculation graph whose memory cost meets the preset memory cost is found, the current ranking The computation graph with the highest score in the queue is used as the target computation graph. In this way
  • the video memory space required by the second computing graph is searched according to the order in which the second computing graph is arranged in the queue, so as to search for a target computing graph whose video memory cost meets the preset video memory cost, that is, step S404.
  • Method 1 Search the memory space of the second calculation graph with the best total scoring result of the queue to determine whether the memory overhead of the second calculation graph meets the preset memory overhead, including the following steps S441 and S442 (not shown in the figure) out):
  • Step S441 in the arrangement queue, search for the first candidate computation graph with the best scoring result.
  • the element arranged at the head of the queue is the first candidate computation graph with the best scoring result.
  • Step S442 in response to the found video memory space required by the first candidate computation graph meeting the preset video memory overhead, determine the first candidate computation graph as the target computation graph.
  • the video memory space required by the first candidate computation graph is determined, it is judged whether the video memory space satisfies the preset video memory overhead, that is, it is judged whether the first candidate computation graph can be run normally within the preset video memory overhead. If the video memory space required by the first candidate computation graph satisfies the preset video memory overhead, it means that the video memory space required by the first candidate computation graph is within the preset video memory overhead range, that is, the video memory space corresponding to the first candidate computation graph can Complete the training of the preset network model. Furthermore, the first candidate computation graph is used as a target computation graph.
  • Method 2 When the video memory cost of the second computing graph with the best scoring result does not meet the preset video memory cost, update the queue by analyzing the optimization scheme of the second computing graph, and continue to search for ratings in the updated sorting order Resulting in the optimal calculation graph, and judging whether the memory overhead of the calculation graph satisfies the preset memory overhead, including the following steps S443 to S446 (not shown):
  • Step S443 in response to the fact that the video memory space required by the first candidate computation graph does not meet the preset video memory overhead, adjust the first candidate computation graph based on the target operating data of the first candidate computation graph, At least one third computation graph is obtained.
  • the target operator corresponding to the target operation data is analyzed in the first candidate computation graph, so that an optimized third computation graph is generated by moving the target operator in the first candidate computation graph.
  • the first candidate computation graph is the second computation graph 32 in FIG. 3B, by analyzing the second computation graph 32, it is determined that the second computation graph 32 includes three target operators; move each target in the second computation graph 32 respectively operator to generate three third calculation graphs; as shown in FIG. 4B , they are the third calculation graphs 41 , 42 and 43 respectively.
  • Step S444 updating the permutation queue based on the scoring result of the at least one third computation graph, to obtain an updated permutation queue.
  • the first candidate computation graph is popped up in the arrangement queue, and then, according to the manner of step S401 and step S402, the scoring result of each third computation graph is determined. Finally, according to the scoring result of each third calculation graph combined with the scoring result of each second calculation graph in the arrangement queue, at least one third calculation graph is inserted into the arrangement queue to obtain an updated arrangement queue. As shown in Figure 4B, since the second calculation graph 32 has been popped up, only the second calculation graph 33 is left in the current queue.
  • the third calculation graph 41, 42 and 43 score results, the third The score of is greater than the second calculation graph 33, the scores of the third calculation graph 42 and 43 are both smaller than the second calculation graph 33, and the third calculation graph 42 is greater than the third calculation graph 43; then the updated alignment queue is shown in Figure 4B , in descending order of scores: the third calculation graph 41 , the second calculation graph 33 , the third calculation graph 42 and the third calculation graph 43 .
  • Step S445 in the updated queue, search whether the video memory space required by the second candidate computation graph with the best scoring result satisfies the preset video memory overhead.
  • the updated queue after determining the video memory space required by the second candidate computation graph, it is judged whether the video memory space meets the preset video memory overhead, if the video memory required by the second candidate computation graph is The space does not meet the preset video memory overhead (for example, the video memory space required by the second candidate computation graph is greater than the preset video memory overhead); then continue to generate a new computation graph based on the optimization scheme of the second candidate computation graph, and follow The scoring result of the new calculation graph updates the updated permutation queue again.
  • Step S446 in response to the video memory space required by the second candidate computation graph meeting the preset video memory overhead, determine the second candidate computation graph as the target computation graph. In this way, based on the queuing sequence, when the memory space required by the calculation graph with the best scoring result does not meet the preset memory overhead, continue to optimize the calculation graph, and search for the calculation graph with the best score in the latest queue. Whether the video memory space meets the preset video memory overhead, so that after multiple searches, the video memory overhead of the searched target calculation graph can be made better.
  • Method 3 When the memory cost of the calculation graph with the best scoring result does not meet the preset memory cost, and the number of searches for the calculation graph reaches the set number threshold, the latest calculation graph with the best scoring result is queued As the target calculation graph, the following step S447 (not shown in the figure) is included:
  • Step S447 in response to the fact that the video memory space required by the second candidate computation graph does not meet the preset memory overhead, and the number of searches reaches the preset number threshold, determine the computation graph with the best scoring result in the queue corresponding to the last search Compute the graph for the target.
  • the preset times threshold may be set based on the number of computation graphs in the alignment queue; for example, the preset times threshold is set to be less than half of the number of computation graphs in the alignment queue. If the video memory space required by the second candidate computation graph does not meet the preset video memory overhead, and after updating the queue based on this second candidate computation graph, the video memory space required by the computation graph with the best score that is searched again still does not meet the preset memory cost. Assuming the memory overhead, when the number of searches reaches the preset number threshold, determine the calculation graph with the best score in the latest arrangement queue as the target calculation graph. In this way, in the case that the calculation graph that meets the preset memory overhead cannot be searched, the calculation graph with the best score in the latest arrangement queue is used as the target calculation graph, so that the searched target calculation graph is the calculation graph with the optimal space cost. picture.
  • the multiple second calculation graphs are sorted, and the target calculation graph is searched according to the sorted queue, so that the searched target calculation graph is space overhead Optimal Computational Graph.
  • the memory space and running time required by the preset network model can be determined by determining the target calculation graph of the preset network model, that is, the above step S105 can be achieved through the following process :
  • the video memory space required by the target computation graph is determined as the video memory space required for training the preset network model.
  • the video memory space required by the target computing graph and the running time are obtained; the video memory space required by the target computing graph, And the running time is used as the estimated video memory space and running time required for training the preset network model, so that the video memory space required for training the network model can be optimized without sacrificing the running time required for training the network model.
  • the video memory when performing image network (ImageNet) training of residual network 269 (ResNeSt269) (including 100 million network parameters), the video memory has already approached the upper limit of V100 32 gigabits (GB), and the training occupation reaches 28GB .
  • the model is further enlarged, or the batch size is increased, the video memory usage of model training will also increase accordingly, and finally the video memory occupied is higher than the video memory capacity of the graphics card, touching the video memory wall, making the model unable to train.
  • Figure 5 is a schematic diagram of the curve change of the calculation graph occupied by the video memory provided by the embodiment of the present disclosure, where the abscissa indicates the execution sequence of each operator in the calculation graph, and the ordinate indicates the occupied memory during the operator execution process.
  • Curve 501 represents the memory application amount
  • curve 502 represents the memory cache amount at different moments during the execution of a task in the calculation graph. It can be seen from the curve 502 that the calculation graph reaches the peak point 503 at the end of the feed-forward phase. The peak point 503 exceeds the peak value of the curve 501, that is, the video memory occupied by the calculation graph is higher than the video memory capacity of the graphics card, so that the model training cannot continue.
  • memory optimization is particularly important.
  • the video memory optimization method based on computational graph analysis is one of them.
  • operators are often simply moved to initially reduce video memory usage.
  • this kind of method is only used as a preliminary video memory optimization, and the focus is placed on subsequent further video memory optimization methods.
  • This type of method can only optimize a small amount of video memory usage, and lacks a complete optimization system.
  • an embodiment of the present disclosure provides a video memory optimization method.
  • a network model is generated into a calculation graph through a conventional deep learning framework (tensorflow, pytorch, etc.) and framework parrots.
  • the large training task can be disassembled into individual operators (Task), each operator will use the original data (operator input) and generate new data (operator output).
  • the calculation graph also shows the space occupied by each operand related to the operator and the time required for the operator to perform calculations, so that the memory usage can be optimized by analyzing the calculation graph.
  • Figure 6 is a schematic flow diagram of the implementation process of the video memory optimization method provided by the embodiment of the present disclosure, and the following description is made in conjunction with the steps shown in Figure 6:
  • Step S601 read the calculation graph information from the JSON file.
  • first use the machine learning framework tensorflow, pytorch, and parrots
  • machine learning framework tensorflow, pytorch, and parrots
  • Step S602 based on the read calculation graph information, generate a corresponding calculation graph object.
  • a new computation graph object is generated, the information read from JSON is obtained, the operator queue is arranged in order, and the input and output of the operator are matched with the operands in sequence.
  • Step S603 analyzing whether the current calculation graph object meets the memory space cost.
  • analyzing the maximum space overhead of the current calculation graph object and the calculation time spent include:
  • step S604 Obtain the topology structure and memory overhead of the calculation graph by traversing all operators and their input and output. If the computation graph object already meets the space cost, go to step S604 and do not need to optimize it. If the computation graph object does not satisfy the space cost, go to step S605.
  • Step S605 based on the topological structure of the computation graph object, search for a position that can be optimized, and add the new computation graph object generated based on the position into the priority queue.
  • the implementation process of finding the location that can be optimized is as follows: first, find out the time point when the memory reaches the peak value; then, using the peak value as the dividing point, find the data that was generated before the peak value but used after the peak value, Where this data resides is where it can be optimized.
  • the operator corresponding to the generated data may be moved behind the peak value to achieve the purpose of reducing the peak value. All operators satisfying this condition are regarded as the optimized scheme of the current computation graph object. Combine the optimization scheme of the current calculation graph object to generate a series of new calculation graph objects, and add the generated new calculation graph objects as elements to the priority queue.
  • the queue is formed based on the scores of new computation graph objects, and the following formula is used to determine the score Score of each computation graph:
  • Score MEMORY_FACTOR*(peak_memory-limit)/limit+(1-MEMORY_FACTOR)*(total_time-origin_time)/origin_time;
  • peak_memory indicates the peak memory usage of the calculation graph
  • limit indicates the memory overhead budget we set
  • total_time indicates the execution time corresponding to the calculation graph
  • origin_time indicates the execution time corresponding to the initial calculation graph.
  • MEMORY_FACTOR is a parameter that weighs the proportion of time and space in the score.
  • Step S606 integrating the first element of the priority queue into a new computation graph object, and judging whether the new computation graph object meets the space overhead.
  • the first element of the priority queue (that is, the computation graph object with the best score) is popped out, and it is judged whether it satisfies the space cost, and if so, proceeds to step S607. If not, return to step S605 to continue the next search. If the searched computation graph object still does not meet the space cost and the number of searches reaches the preset upper limit, go to step S608.
  • Step S607 take the computation graph object satisfying the condition as an output and save it as a JSON file.
  • Step S608 the search is terminated, and the first element of the current priority queue is used as the optimal calculation graph.
  • the above steps S601 to S608 provide a search strategy for backtracking and searching the optimal calculation graph of a priority queue: First, use the time when the video memory reaches the peak value as the limit, and search for data that was generated before the peak but used after the peak , and take the position of the operator corresponding to such data as the position that can be optimized; secondly, calculate the score based on the optimized calculation graph of the position, and add it to the priority queue for searching according to the score; at the same time, considering the peak transfer situation, That is, every time a new calculation graph is generated with the optimization scheme, the peak video memory usage analysis will be performed. Finally, if no calculation graph that meets the requirements is found, the currently searched calculation graph with the best score will be returned at the end.
  • the calculation graph of the model when the user trains the model in an actual situation, the calculation graph of the model may be analyzed and optimized using the video memory optimization method provided by the embodiment of the present disclosure. Users can have a general understanding of the memory space and time of the model based on the calculation graph. In the embodiment of the present disclosure, the calculation graph optimization is completed before starting the training, combined with more subsequent optimizations, the comprehensive memory optimization will be very considerable.
  • the initial calculation graph memory usage peak value is 3.38 gigabits (GiB) and takes 129.03 milliseconds (ms), and the optimal calculation graph memory usage peak value is 1.72GiB took 136.09ms.
  • the optimization rate of video memory reached 49%.
  • the video memory optimization method provided by the embodiments of the present disclosure will further increase the video memory optimization effect. In this way, the memory size occupied by large-scale deep learning can be greatly reduced, and the cost of large-scale training can be greatly reduced in terms of space overhead.
  • the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possible
  • the inner logic is OK.
  • the embodiment of the present disclosure also provides a video memory optimization device corresponding to the video memory optimization method. Since the problem-solving principle of the device in the embodiment of the present disclosure is similar to the above-mentioned video memory optimization method of the embodiment of the present disclosure, the implementation of the device See the implementation of the method.
  • FIG. 7 is a schematic diagram of the structural composition of the video memory optimization device according to an embodiment of the present disclosure. As shown in FIG. 7 , the video memory optimization device 700 includes:
  • the first generation part 701 is configured to generate a first calculation graph based on a preset network model
  • the first determination part 702 is configured to determine the correlation between the peak value of the video memory of the first calculation graph and the running data
  • the second generation part 703 is configured to adjust the first calculation graph based on the association relationship, and generate at least one second calculation graph;
  • the second determining part 704 is configured to determine a target calculation graph in the at least one second calculation graph based on the peak value of the video memory and the running time of the at least one second calculation graph;
  • the third determining part 705 is configured to determine the video memory space required by the preset network model based on the target computation graph.
  • the first generating part 701 includes:
  • the first generation subpart is configured to generate calculation graph information in a data exchange format based on the preset network model
  • the second generation subpart is configured to generate a first computation graph matched by the computation graph information based on the operator queue in the computation graph information.
  • the first determining part 702 includes:
  • the first determination subpart is configured to determine the occurrence moment of the video memory peak in the first calculation graph
  • the second determination subpart is configured to determine the operation data of the operator in the first calculation graph
  • the third determination subpart is configured to determine the generation time of the operation data and the application time of the operation data in the first calculation graph; determine the generation time and the application time, and the peak value of the video memory
  • the timing relationship between the occurrence moments of is the association relationship.
  • the second generating part 703 includes:
  • the fourth determination subpart is configured to determine, in the first calculation graph, the target operating data whose association relationship satisfies a preset condition
  • the first adjustment subpart is configured to adjust the first calculation graph based on the target operation data to generate the at least one second calculation graph.
  • the fourth determining subsection includes:
  • the first determining unit is configured to, in the operation data of the operator in the first calculation graph, determine the operation whose generation time is before the occurrence time of the video memory peak value and whose application time is after the occurrence time of the video memory peak value
  • the data is the target operation data satisfying the preset condition.
  • the first adjustment subsection includes:
  • the second determination unit is configured to determine a target operator corresponding to the target operation data in the first calculation graph
  • the first adjustment unit is configured to adjust the target operator in the first computation graph based on the occurrence time of the video memory peak in the first computation graph to generate the at least one second computation graph.
  • the first adjustment unit is further configured to:
  • the second computation graph is generated after the execution time of the target operator is adjusted to the occurrence time of the video memory peak.
  • the second determining part 704 includes:
  • the first acquisition subpart is configured to acquire a preset video memory overhead and a preset trade-off ratio; wherein the preset trade-off ratio is used to weigh the ratio between the running time of the calculation graph and the required video memory;
  • the first scoring sub-section is configured to perform an evaluation on the peak value of video memory, the running time of each second computing graph, and the running time of the corresponding first computing graph based on the preset video memory overhead and the preset trade-off ratio Scoring, obtaining the scoring result of each of the second calculation graphs;
  • the first sorting subpart is configured to sort the second calculation graphs in the at least one second calculation graph based on the scoring results of each second calculation graph to obtain a sorting queue;
  • the fifth determining subpart is configured to determine the target computation graph in the at least one second computation graph based on the sort queue.
  • the first scoring subsection includes:
  • the third determining unit is configured to determine the video memory score of each second computing graph based on the video memory peak value of each second computing graph, the preset video memory overhead, and the preset trade-off ratio;
  • the first scoring unit is configured to determine, based on the preset trade-off ratio, the running time of each second computing graph, and the corresponding running time of the first computing graph, the value of each second computing graph runtime rating;
  • the fourth determination unit is configured to determine the scoring result of each second calculation image based on the video memory score and the runtime score of each second calculation graph.
  • the fifth determining subsection includes:
  • the first search unit is configured to search for the first candidate computation graph with the best scoring result in the arrangement queue
  • the fifth determining unit is configured to determine the first candidate computation graph as the target computation graph in response to the searched video memory space required by the first candidate computation graph meeting the preset video memory overhead.
  • the fifth determining subsection includes:
  • the second adjustment unit is configured to, in response to the video memory space required by the first candidate computation graph not satisfying the preset video memory overhead, based on the target running data of the first candidate computation graph, The calculation graph is adjusted to obtain at least one third calculation graph;
  • the first updating unit is configured to update the permutation queue based on the scoring result of the at least one third computation graph, to obtain an updated permutation queue;
  • the second search unit is configured to, in the updated queue, search whether the video memory space required by the second candidate computation graph with the best scoring result satisfies the preset video memory overhead;
  • the sixth determining unit is configured to determine the scoring result in the queue corresponding to the last search in response to the fact that the video memory space required by the second candidate computation graph does not meet the preset video memory overhead, and the number of searches reaches a preset number threshold
  • the optimal computation graph is the target computation graph.
  • the third determining part 705 is further configured to:
  • the video memory space required by the target computation graph is determined as the video memory space required for training the preset network model.
  • a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.
  • the above video memory optimization method is implemented in the form of software function modules and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
  • the computer software products are stored in a storage medium, including several instructions for A computer device (which may be a terminal, a server, etc.) is made to execute all or part of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: various media that can store program codes such as U disk, sports hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. As such, the disclosed embodiments are not limited to any specific combination of hardware and software.
  • the embodiment of the present disclosure further provides a computer program product, the computer program product includes computer-executable instructions, and after the computer-executable instructions are executed, the video memory optimization method provided in the embodiments of the present disclosure can be implemented.
  • the embodiments of the present disclosure further provide a computer storage medium, where computer executable instructions are stored on the computer storage medium, and when the computer executable instructions are executed by a processor, the video memory optimization method provided in the foregoing embodiments is implemented.
  • FIG. 8 is a schematic diagram of the composition and structure of a computer device in an embodiment of the present disclosure.
  • the computer device 800 includes: a processor 801, at least one communication bus, and a communication interface 802 , at least one external communication interface and memory 803 .
  • the communication interface 802 is configured to realize connection and communication between these components.
  • the communication interface 802 may include a display screen, and the external communication interface may include a standard wired interface and a wireless interface.
  • the processor 801 is configured to execute a video memory optimization program in the memory, so as to implement the video memory optimization method provided in the foregoing embodiments.
  • the device involved in the embodiments of the present disclosure may be at least one of a system, a method, and a computer program product.
  • a computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present disclosure.
  • a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
  • a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Examples of computer-readable storage media include: portable computer disks, hard disks, Random Access Memory (RAM), Read-Only Memory (ROM), erasable Electrical Programmable Read Only Memory (EPROM) or flash memory, Static Random-Access Memory (Static Random-Access Memory, SRAM), Portable Compact Disc Read-Only Memory (CD-ROM), Digital Video Discs (DVDs), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • EPROM erasable Electrical Programmable Read Only Memory
  • flash memory Static Random-Access Memory
  • SRAM Static Random-Access Memory
  • CD-ROM Portable Compact Disc Read-Only Memory
  • DVDs Digital Video Discs
  • memory sticks floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
  • computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
  • the computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device over at least one of a network, such as the Internet, a local area network, a wide area network, and a wireless network.
  • the network may include at least one of copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and edge servers.
  • a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer program instructions for performing the operations of the present disclosure may be assembly instructions, Industry Standard Architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages, such as the “C” language or similar programming languages.
  • Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it may be connected to an external computer (for example, using Internet Service Provider to connect via the Internet).
  • LAN Local Area Network
  • WAN Wide Area Network
  • electronic circuits such as programmable logic circuits, FPGAs, or programmable logic arrays (Programmable Logic Arrays, PLAs), can be customized by using state information of computer-readable program instructions, which can execute computer-readable Read program instructions, thereby implementing various aspects of the present disclosure.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may be used as a single unit, or two or more units may be integrated into one unit; the above-mentioned integration
  • the unit can be realized in the form of hardware or in the form of hardware plus software functional unit.
  • Those of ordinary skill in the art can understand that all or part of the steps to realize the above method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in computer-readable storage media.
  • the execution includes The steps in the foregoing method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes such as removable storage devices, read-only memories, magnetic disks or optical disks.
  • the above-mentioned integrated units of the present disclosure are implemented in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium, and includes several instructions for Make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.
  • Embodiments of the present disclosure provide a video memory optimization method, device, device, storage medium, and program product, wherein the method includes: generating a first calculation graph based on a preset network model; determining a video memory peak value of the first calculation graph The association relationship with the running data; based on the association relationship, adjust the first calculation graph to generate at least one second calculation graph; based on the video memory peak value and runtime of the at least one second calculation graph, in A target computation graph is determined in the at least one second computation graph; based on the target computation graph, the display memory space required by the preset network model is determined.

Abstract

Embodiments of the present disclosure provide a video memory optimization method and apparatus, a device, a storage medium and a program product. The method comprises: generating a first calculation graph on the basis of a preset network model; determining an association relationship between a video memory peak value of the first calculation graph and operation data; adjusting the first calculation graph on the basis of the association relationship to generate at least one second calculation graph; determining a target calculation graph in the at least one second calculation graph on the basis of the video memory peak value and the operation duration of the at least one second calculation graph; and determining a video memory space required by the preset network model on the basis of the target calculation graph.

Description

一种显存优化方法、装置、设备、存储介质及程序产品A video memory optimization method, device, equipment, storage medium and program product
相关申请的交叉引用Cross References to Related Applications
本专利申请要求2021年10月27日提交的中国专利申请号为202111254294.7、申请人为上海商汤科技开发有限公司,申请名称为“一种显存优化方法、装置、设备及存储介质”的优先权,该申请的全文以引用的方式并入本公开中。This patent application claims the priority of the Chinese patent application number 202111254294.7 submitted on October 27, 2021, the applicant is Shanghai Shangtang Technology Development Co., Ltd., and the application name is "a video memory optimization method, device, equipment and storage medium", The entirety of this application is incorporated by reference into this disclosure.
技术领域technical field
本公开实施例涉及数据处理技术领域,涉及但不限于一种显存优化方法、装置、设备、存储介质及程序产品。Embodiments of the present disclosure relate to the technical field of data processing, and relate to, but are not limited to, a video memory optimization method, device, equipment, storage medium, and program product.
背景技术Background technique
随着深度学习领域的快速发展,训练拥有超多参数的大模型甚至超大模型逐渐步入人们的视野。训练模型逐渐变大变深的同时必然会使显存的开销加剧增长,当模型进一步加大,或者加大批大小(Batch size)后,模型训练的显存占用也会随之增长,最后高于显卡的显存容量,使得模型无法训练。With the rapid development of the field of deep learning, the training of large models with super-parameters or even super-large models has gradually entered people's field of vision. As the training model gradually becomes larger and deeper, the overhead of video memory will inevitably increase. When the model is further enlarged, or the batch size is increased, the memory usage of model training will also increase, and finally higher than that of the graphics card. The memory capacity makes the model unable to train.
发明内容Contents of the invention
本公开实施例提供一种显存优化技术方案。Embodiments of the present disclosure provide a technical solution for video memory optimization.
本公开实施例的技术方案是这样实现的:The technical scheme of the embodiment of the present disclosure is realized in this way:
本公开实施例提供一种显存优化方法,所述方法包括:An embodiment of the present disclosure provides a video memory optimization method, the method comprising:
基于预设网络模型,生成第一计算图;generating a first calculation graph based on a preset network model;
确定所述第一计算图的显存峰值与运行数据之间的关联关系;Determine the correlation between the peak value of the video memory of the first calculation graph and the running data;
基于所述关联关系,对所述第一计算图进行调整,生成至少一个第二计算图;Adjusting the first calculation graph based on the association relationship to generate at least one second calculation graph;
基于所述至少一个第二计算图的显存峰值和运行时长,在所述至少一个第二计算图中确定目标计算图;Determining a target computation graph in the at least one second computation graph based on the video memory peak value and runtime of the at least one second computation graph;
基于所述目标计算图,确定所述预设网络模型所需的显存空间。Based on the target computation graph, the video memory space required by the preset network model is determined.
在一些实施例中,所述基于预设网络模型,生成第一计算图,包括:基于所述预设网络模型,生成数据交换格式的计算图信息;基于所述计算图信息中的算子队列,生成所述计算图信息匹配的第一计算图。如此,通过采用深度学习框架将需要训练的模型生成JSON格式的计算图,能够得到第一计算图,进而通过多种优化方案对第一计算图进行调整生成多个第二计算图,便于从中筛选显存花销最优的计算图。In some embodiments, the generating the first computation graph based on the preset network model includes: generating computation graph information in a data exchange format based on the preset network model; based on the operator queue in the computation graph information , generating a first computation graph matching the computation graph information. In this way, by using the deep learning framework to generate a calculation graph in JSON format for the model that needs to be trained, the first calculation graph can be obtained, and then the first calculation graph can be adjusted through various optimization schemes to generate multiple second calculation graphs for easy screening. Computational graph with optimal memory cost.
在一些实施例中,所述确定所述第一计算图的显存峰值与运行数据之间的关联关系,包括:确定所述第一计算图中显存峰值的出现时刻;确定所述第一计算图中算子的运行数据;确定所述运行数据的生成时刻和所述运行数据在所述第一计算图中的应用时刻;确定所述生成时刻和所述应用时刻,与所述显存峰值的出现时刻之间的时序关系,为所述关联关系。如此,通过分析算子生成的运行数据的时刻和应用该运行数据的时刻,与达到显存峰值的时刻之间的时序关系,能够进一步确定是否需要移动该算子以降低峰值。In some embodiments, the determining the correlation between the peak value of the video memory in the first calculation graph and the running data includes: determining the occurrence time of the peak value of the video memory in the first calculation graph; The operation data of the operator; determine the generation time of the operation data and the application time of the operation data in the first calculation graph; determine the generation time and the application time, and the occurrence of the video memory peak The timing relationship between moments is the association relationship. In this way, by analyzing the timing relationship between the time of running data generated by the operator, the time when the running data is applied, and the time when the peak value of the video memory is reached, it can be further determined whether the operator needs to be moved to reduce the peak value.
在一些实施例中,基于所述关联关系,对所述第一计算图进行调整,生成至少一个第二计算图,包括:在所述第一计算图中,确定所述关联关系满足预设条件的目标运行数据;基于所述目标运行数据,对所述第一计算图进行调整,生成所述至少一个第二计算图。如此,通过在第一计算图中,移动目标运行数据对应的算子,得到第二计算图,从而能够降低第二计算图的峰值。In some embodiments, adjusting the first calculation graph based on the association relationship to generate at least one second calculation graph includes: determining that the association relationship meets a preset condition in the first calculation graph The target operation data; based on the target operation data, adjust the first calculation graph to generate the at least one second calculation graph. In this way, the second calculation graph can be obtained by moving the operator corresponding to the target operation data in the first calculation graph, so that the peak value of the second calculation graph can be reduced.
在一些实施例中,所述在所述第一计算图中,确定所述关联关系满足预设条件的目标运行数据,包括:在所述第一计算图中的算子的运行数据中,确定生成时刻在所述显存峰值的出现时间之前且应用时刻在所述显存峰值的出现时刻之后的运行数据,为满足所述预设条件的目标运行数据。In some embodiments, in the first calculation graph, determining the target operation data whose association relationship satisfies a preset condition includes: determining in the operation data of operators in the first calculation graph The operation data whose generation time is before the occurrence time of the video memory peak value and whose application time is after the occurrence time of the video memory peak value is the target operation data satisfying the preset condition.
在一些实施例中,所述基于所述目标运行数据,对所述第一计算图进行调整,生成所述至少一个第二计算图,包括:在所述第一计算图中,确定所述目标运行数据对应的目标算子;基于所述第一计算图中所述显存峰值的出现时刻,调整所述第一计算图中的所述目标算子,生成所述至少一个第二计 算图。如此,针对第一计算图,通过按照该第一计算图中显存峰值的出现时刻,移动目标算子,从而能够生成多个第二计算图。In some embodiments, the adjusting the first calculation graph based on the target operation data to generate the at least one second calculation graph includes: determining the target in the first calculation graph Running the target operator corresponding to the data; adjusting the target operator in the first calculation graph based on the occurrence time of the video memory peak in the first calculation graph to generate the at least one second computation graph. In this way, for the first computation graph, by moving the target operator according to the occurrence time of the video memory peak in the first computation graph, multiple second computation graphs can be generated.
在一些实施例中,所述基于所述第一计算图中所述显存峰值的出现时刻,调整所述第一计算图中的所述目标算子,生成所述至少一个第二计算图,包括:在所述第一计算图中,将所述目标算子的执行时刻调整至所述显存峰值的出现时刻之后,生成所述第二计算图。如此,通过移动目标算子至显存峰值之后,能够降低新生成的第一计算图的峰值,从而优化第二计算图所需的显存空间。In some embodiments, the adjusting the target operator in the first computation graph based on the occurrence time of the video memory peak value in the first computation graph to generate the at least one second computation graph includes : In the first computation graph, after the execution time of the target operator is adjusted to the occurrence time of the video memory peak value, the second computation graph is generated. In this way, by moving the target operator to the peak value of the video memory, the peak value of the newly generated first computation graph can be reduced, thereby optimizing the video memory space required by the second computation graph.
在一些实施例中,所述基于所述至少一个第二计算图的显存峰值和运行时长,在所述至少一个第二计算图中确定目标计算图,包括:获取预设显存开销和预设权衡比值;其中,所述预设权衡比值用于权衡计算图的运行时长和所需显存之间的比重;基于所述预设显存开销和预设权衡比值,对所述每一第二计算图的显存峰值、运行时长和所述对应的第一计算图的运行时长进行评分,得到所述每一第二计算图的评分结果;基于所述每一第二计算图的评分结果,对所述至少一个第二计算图中的第二计算图进行排序,得到排序队列;基于所述排序队列,在所述至少一个第二计算图中确定所述目标计算图。如此,能够通过队列搜索查找到显存开销最优的目标计算图。In some embodiments, the determining the target computing graph in the at least one second computing graph based on the peak video memory and running time of the at least one second computing graph includes: obtaining a preset video memory overhead and a preset trade-off Ratio; wherein, the preset trade-off ratio is used to weigh the proportion between the running time of the calculation graph and the required video memory; based on the preset video memory overhead and the preset trade-off ratio, the Scoring the peak value of video memory, running time and the running time of the corresponding first computing graph to obtain the scoring result of each second computing graph; based on the scoring result of each second computing graph, at least The second computation graphs in a second computation graph are sorted to obtain a sort queue; based on the sort queue, the target computation graph is determined in the at least one second computation graph. In this way, the target calculation graph with the optimal memory overhead can be found through queue search.
在一些实施例中,所述基于所述预设显存开销和预设权衡比值,对所述每一第二计算图的显存峰值、运行时长和所述对应的第一计算图的运行时长进行评分,得到所述每一第二计算图的评分结果,包括:基于所述每一第二计算图的显存峰值、所述预设显存开销和预设权衡比值,确定所述每一第二计算图的显存评分;基于所述预设权衡比值、所述每一第二计算图的运行时长和所述对应的第一计算图的运行时长,确定所述每一第二计算图的运行时长评分;基于所述每一第二计算图的显存评分和所述运行时长评分,确定所述每一第二计算图像的评分结果。如此,在对第二计算图进行评分的阶段,综合考虑第二计算图的运行时长和显存峰值,使最后的计算图在优化显存空间的同时不会牺牲大量时间成本。In some embodiments, based on the preset video memory overhead and the preset trade-off ratio, scoring the video memory peak value, running time of each second computing graph and the running time of the corresponding first computing graph , obtaining the scoring result of each second computation graph, including: determining each second computation graph based on the peak video memory value of each second computation graph, the preset video memory overhead, and the preset trade-off ratio The video memory score; based on the preset trade-off ratio, the runtime of each second computation graph and the runtime of the corresponding first computation graph, determine the runtime score of each second computation graph; A scoring result of each second computing image is determined based on the video memory score and the running time score of each second computing graph. In this way, in the stage of scoring the second computing graph, the running time of the second computing graph and the peak value of video memory are comprehensively considered, so that the final computing graph can optimize the video memory space without sacrificing a lot of time cost.
在一些实施例中,所述基于所述排序队列,在所述至少一个第二计算图中确定所述目标计算图,包括:在所述排列队列中,搜索评分结果最优的第一候选计算图;响应于搜索到的所述第一候选计算图所需的显存空间满足所述预设显存开销,确定所述第一候选计算图为所述目标计算图。如此,既能够尽可能地减少搜索次数,还能够使得搜索到的目标计算图的显存开销较小。In some embodiments, the determining the target computation graph in the at least one second computation graph based on the sorting queue includes: searching for the first candidate computation with the best scoring result in the ranking queue Graph; determining that the first candidate computation graph is the target computation graph in response to the found video memory space required by the first candidate computation graph meeting the preset display memory overhead. In this way, the number of searches can be reduced as much as possible, and the memory overhead of the searched target calculation graph can be reduced.
在一些实施例中,所述基于所述排序队列,在所述至少一个第二计算图中确定所述目标计算图,包括:响应于所述第一候选计算图所需的显存空间不满足所述预设显存开销,基于所述第一候选计算图的目标运行数据,对所述第一候选计算图进行调整,得到至少一个第三计算图;基于所述至少一个第三计算图的评分结果更新所述排列队列,得到已更新的排列队列;在所述已更新的排列队列中,搜索评分结果最优的第二候选计算图所需的显存空间是否满足所述预设显存开销;响应于所述第二候选计算图所需的显存空间不满足所述预设显存开销,且搜索次数达到预设次数阈值,确定末次搜索对应的排列队列中评分结果最优的计算图为所述目标计算图。如此,在搜索不到满足预设显存开销的计算图的情况下,将最新的排列队列中评分最优的计算图作为目标计算图,从而使得搜索到的目标计算图是空间开销最优的计算图。In some embodiments, the determining the target computation graph in the at least one second computation graph based on the sorting queue includes: responding to the fact that the video memory space required by the first candidate computation graph does not meet the required The preset video memory overhead, based on the target operating data of the first candidate computation graph, adjust the first candidate computation graph to obtain at least one third computation graph; based on the scoring result of the at least one third computation graph Updating the arrangement queue to obtain an updated arrangement queue; in the updated arrangement queue, searching whether the video memory space required by the second candidate calculation graph with the best scoring result satisfies the preset video memory overhead; in response The video memory space required by the second candidate calculation graph does not meet the preset video memory overhead, and the number of searches reaches the preset number threshold, and the calculation graph with the best scoring result in the queue corresponding to the last search is determined to be the target calculation graph picture. In this way, if the calculation graph that meets the preset memory overhead cannot be searched, the calculation graph with the best score in the latest arrangement queue is used as the target calculation graph, so that the searched target calculation graph is the calculation graph with the optimal space cost. picture.
在一些实施例中,所述基于所述目标计算图,确定所述预设网络模型所需的显存空间,包括:将所述目标计算图所需的显存空间,确定为训练所述预设网络模型所需的显存空间。如此,能够优化训练该网络模型所需的显存空间,并且不会牺牲训练该网络模型所需的运行时长。In some embodiments, the determining the video memory space required by the preset network model based on the target computation graph includes: determining the video memory space required by the target computation graph as training the preset network The video memory space required by the model. In this way, the video memory space required for training the network model can be optimized without sacrificing the running time required for training the network model.
本公开实施例提供一种显存优化装置,所述装置包括:An embodiment of the present disclosure provides a video memory optimization device, the device comprising:
第一生成模块,被配置为基于预设网络模型,生成第一计算图;The first generation module is configured to generate a first calculation graph based on a preset network model;
第一确定模块,被配置为确定所述第一计算图的显存峰值与运行数据之间的关联关系;The first determination module is configured to determine the correlation between the peak value of the video memory of the first calculation graph and the running data;
第二生成模块,被配置为基于所述关联关系,对所述第一计算图进行调整,生成至少一个第二计算图;The second generation module is configured to adjust the first calculation graph based on the association relationship, and generate at least one second calculation graph;
第二确定模块,被配置为基于所述至少一个第二计算图的显存峰值和运行时长,在所述至少一个第二计算图中确定目标计算图;The second determination module is configured to determine a target calculation graph in the at least one second calculation graph based on the peak value of the video memory and the running time of the at least one second calculation graph;
第三确定模块,被配置为基于所述目标计算图,确定所述预设网络模型所需的显存空间。The third determination module is configured to determine the video memory space required by the preset network model based on the target computation graph.
本公开实施例提供一种计算机存储介质,所述计算机存储介质上存储有计算机可执行指令,该计算机可执行指令被执行后,能够实现上述的显存优化方法。An embodiment of the present disclosure provides a computer storage medium, on which computer-executable instructions are stored. After the computer-executable instructions are executed, the above video memory optimization method can be realized.
本公开实施例提供一种计算机设备,所述计算机设备包括存储器和处理器,所述存储器上存储有计算机可执行指令,所述处理器运行所述存储器上的计算机可执行指令时能够实现上述的显存优化方法。An embodiment of the present disclosure provides a computer device. The computer device includes a memory and a processor. Computer-executable instructions are stored in the memory. When the processor runs the computer-executable instructions in the memory, the above-mentioned Memory optimization method.
本公开实施例还提供一种计算机程序产品,所述计算机程序产品包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备的处理器执行上述任一实施例所述的显存优化方法。An embodiment of the present disclosure also provides a computer program product, the computer program product includes computer readable code, and when the computer readable code runs in an electronic device, a processor of the electronic device executes any one of the above-mentioned The video memory optimization method described in the embodiment.
本公开实施例提供一种显存优化方法、装置、设备、存储介质及程序产品,对于获取的预设网络模型,对于获取的预设网络模型,首先,通过生成表示该网络模型运行过程的第一计算图,并分析该第一计算图中运行所需的显存峰值与各算子的运行数据之间的关联关系,从而能够通过对第一计算图进行优化生成至少一个第二计算图;然后,综合考虑该第二计算图的显存峰值以及运行该第二计算图所需的运行时长,在多个第二计算图中查找目标计算图;这样,搜索到的目标计算图既能够优化显存空间,还能够兼顾运行时长。最后,通过该目标计算图,实现对预设网络模型所需的显存空间的优化。如此,通过在生成的多个第二计算图中搜索显存花销最优的目标计算图,而且在搜索目标计算图的过程中,将计算图的时间开销也综合加入到计算图的考量之中,使得最终的目标计算图同时满足空间预算和时间预算,从而在不牺牲大量时间成本的基础上,优化了显存空间。Embodiments of the present disclosure provide a video memory optimization method, device, device, storage medium, and program product. For the acquired preset network model, firstly, by generating the first calculation graph, and analyze the correlation between the peak video memory required for operation in the first calculation graph and the operating data of each operator, so that at least one second calculation graph can be generated by optimizing the first calculation graph; then, Comprehensively consider the peak value of the video memory of the second computing graph and the running time required to run the second computing graph, and search for the target computing graph in multiple second computing graphs; in this way, the searched target computing graph can optimize the video memory space, It can also take into account the running time. Finally, through the target calculation graph, the optimization of the video memory space required by the preset network model is realized. In this way, by searching for the target computing graph with the optimal memory cost in the generated multiple second computing graphs, and in the process of searching the target computing graph, the time overhead of the computing graph is also integrated into the consideration of the computing graph , so that the final target calculation graph satisfies both the space budget and the time budget, thereby optimizing the video memory space without sacrificing a lot of time cost.
附图说明Description of drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开实施例的实施例,并与说明书一起用于说明本公开实施例的技术方案。应当理解,以下附图仅示出了本公开实施例的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the following will briefly introduce the accompanying drawings used in the embodiments. The accompanying drawings here are incorporated into the specification and constitute a part of the specification. The drawings show embodiments consistent with the embodiments of the present disclosure, and are used together with the description to illustrate the technical solutions of the embodiments of the present disclosure. It should be understood that the following drawings only show some embodiments of the embodiments of the present disclosure, and therefore should not be regarded as limiting the scope. For those of ordinary skill in the art, without any creative work, Other related drawings can also be obtained from these drawings.
图1为本公开实施例提供的显存优化方法的实现流程示意图;FIG. 1 is a schematic diagram of an implementation flow of a video memory optimization method provided by an embodiment of the present disclosure;
图2为本公开实施例提供的显存优化方法的另一实现流程示意图;FIG. 2 is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure;
图3A为本公开实施例提供的显存优化方法的另一实现流程示意图;FIG. 3A is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure;
图3B为本公开实施例提供的显存优化方法的应用场景示意图;FIG. 3B is a schematic diagram of an application scenario of a video memory optimization method provided by an embodiment of the present disclosure;
图4A为本公开实施例提供的显存优化方法的再一实现流程示意图;FIG. 4A is a schematic flow diagram of yet another implementation of the video memory optimization method provided by the embodiment of the present disclosure;
图4B为本公开实施例提供的显存优化方法的另一应用场景示意图;FIG. 4B is a schematic diagram of another application scenario of the video memory optimization method provided by the embodiment of the present disclosure;
图5为本公开实施例提供的计算图占据显存的曲线变化示意图;FIG. 5 is a schematic diagram of the curve change of the calculation graph occupied by the video memory provided by the embodiment of the present disclosure;
图6为本公开实施例提供的显存优化方法的实现流程示意图;FIG. 6 is a schematic diagram of an implementation flow of a video memory optimization method provided by an embodiment of the present disclosure;
图7为本公开实施例显存优化装置的结构组成示意图;FIG. 7 is a schematic diagram of the structure and composition of a video memory optimization device according to an embodiment of the present disclosure;
图8为本公开实施例计算机设备的组成结构示意图。FIG. 8 is a schematic diagram of the composition and structure of a computer device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对发明的具体技术方案做进一步详细描述。以下实施例用于说明本公开,但不用来限制本公开的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the specific technical solutions of the invention will be further described in detail below in conjunction with the drawings in the embodiments of the present disclosure. The following examples are used to illustrate the present disclosure, but not to limit the scope of the present disclosure.
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.
在以下的描述中,所涉及的术语“第一\第二\第三”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本公开实施例能够以除了在这里图示或描述的以外的顺序实施。In the following description, the term "first\second\third" is only used to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, "first\second\third" Where permitted, the specific order or sequencing may be interchanged such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein.
除非另有定义,本文所使用的所有的技术和科学术语与属于本公开的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本公开实施例的目的,不是旨在限制本公开。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terms used herein are only for the purpose of describing the embodiments of the present disclosure, and are not intended to limit the present disclosure.
对本公开实施例进行进一步详细说明之前,对本公开实施例中涉及的名词和术语进行说明,本公 开实施例中涉及的名词和术语适用于如下的解释。Before the embodiments of the present disclosure are further described in detail, the nouns and terms involved in the embodiments of the present disclosure are described, and the nouns and terms involved in the embodiments of the present disclosure are applicable to the following explanations.
(1)计算图,用于将计算过程图形化表示出来。计算图是一种描述方程的“语言”,既然是图,则有节点(比如,变量)和边(操作(比如,简单函数))。在深度学习领域,神经网络模型本质上都可以用一个计算图来表示,其训练过程可以分为三个部分:前向传播,反向传播和参数更新。(1) Calculation graph, which is used to graphically represent the calculation process. A computational graph is a "language" for describing equations. Since it is a graph, it has nodes (for example, variables) and edges (operations (for example, simple functions)). In the field of deep learning, a neural network model can essentially be represented by a computational graph, and its training process can be divided into three parts: forward propagation, back propagation, and parameter update.
(2)显存,也被叫做帧缓存,用于存储显卡芯片处理过的数据或者即将提取的渲染数据。如同计算机的内存一样,显存是用来存储要处理的图形信息的部件。(2) Video memory, also called frame buffer, is used to store the data processed by the graphics card chip or the rendering data to be extracted. Like computer memory, video memory is a component used to store graphics information to be processed.
下面说明本公开实施例提供的显存优化的设备的示例性应用,本公开实施例提供的设备可以实施为具有图像采集功能的笔记本电脑,平板电脑,台式计算机,相机,移动设备(例如,个人数字助理,专用消息设备,便携式游戏设备)等各种类型的用户终端,也可以实施为服务器。下面,将说明设备实施为终端或服务器时示例性应用。The exemplary application of the video memory optimized device provided by the embodiment of the present disclosure is described below. The device provided by the embodiment of the present disclosure can be implemented as a notebook computer, a tablet computer, a desktop computer, a camera, a mobile device (for example, a personal digital Various types of user terminals such as assistants, dedicated messaging devices, portable game devices, etc., can also be implemented as servers. Below, an exemplary application when the device is implemented as a terminal or a server will be described.
该方法可以应用于计算机设备,该方法所实现的功能可以通过计算机设备中的处理器调用程序代码来实现,当然程序代码可以保存在计算机存储介质中,可见,该计算机设备至少包括处理器和存储介质。The method can be applied to a computer device, and the functions realized by the method can be realized by calling a program code by a processor in the computer device. Of course, the program code can be stored in a computer storage medium. It can be seen that the computer device includes at least a processor and a storage device. medium.
本公开实施例提供一种显存优化方法,如图1所示,结合如图1所示步骤进行说明:An embodiment of the present disclosure provides a video memory optimization method, as shown in FIG. 1 , which is described in conjunction with the steps shown in FIG. 1 :
步骤S101,基于预设网络模型,生成第一计算图。Step S101, generating a first computation graph based on a preset network model.
在一些实施例中,预设网络模型可以是任意类型的待训练网络模型,比如,待训练的深度神经网络模型、待训练的残差网络模型或者任一大规模的待训练神经网络模型等。第一计算图为该预设网络模型的计算过程的图形化表示,包括相连接的节点和边。其中,节点表示该计算图中的执行任务的各个算子,边用于将各个算子按照该网络模型中各任务的执行顺序进行连接。In some embodiments, the preset network model may be any type of network model to be trained, such as a deep neural network model to be trained, a residual network model to be trained, or any large-scale neural network model to be trained. The first calculation graph is a graphical representation of the calculation process of the preset network model, including connected nodes and edges. Wherein, the nodes represent each operator executing tasks in the computation graph, and the edges are used to connect each operator according to the execution sequence of each task in the network model.
在一些可能的实现方式中,将该预设网络模型输入到深度学习框架(比如,tensorflow、pytorch或parrots等),将该网络模型生成JS对象简谱格式(JavaScript Object Notation,JSON)的计算图信息;通过读入JSON格式的计算图信息编排算子队列,将每一算子的输入数据和输出数据依次与操作数据进行匹配,从而生成一个第一计算图;这样,基于每一深度学习框架生成的JSON格式的计算图信息均能够生成一个计算图。在一个具体例子中,以该网络模型为待训练的图像识别网络,该网络中包括输入层、卷积层、池化层和全连接层等;首先,将该图像识别网络输入到不同的深度学习框架中进行JSON格式的计算图信息提取,得到每一框架对应的JSON格式的计算图信息,在该计算图信息中,输入层、卷积层、池化层和全连接层分别表示为执行不同任务的算子;然后,按照这些算子的执行顺序,通过操作边将不同的算子连接起来,得到表示该图像识别网络的第一计算图。In some possible implementations, the preset network model is input into a deep learning framework (such as tensorflow, pytorch, or parrots, etc.), and the network model is generated into calculation graph information in JavaScript Object Notation (JSON) format ; Arrange the operator queue by reading in the calculation graph information in JSON format, and match the input data and output data of each operator with the operation data in turn, so as to generate a first calculation graph; in this way, based on each deep learning framework, generate Computational graph information in JSON format can generate a computational graph. In a specific example, the network model is used as an image recognition network to be trained, which includes an input layer, a convolutional layer, a pooling layer, and a fully connected layer; first, the image recognition network is input to different depths The calculation graph information in JSON format is extracted in the learning framework, and the calculation graph information in JSON format corresponding to each framework is obtained. In the calculation graph information, the input layer, convolutional layer, pooling layer and fully connected layer are respectively represented as execution Operators of different tasks; then, according to the execution order of these operators, different operators are connected through operation edges to obtain the first calculation graph representing the image recognition network.
步骤S102,确定所述第一计算图的显存峰值与运行数据之间的关联关系。Step S102, determining the correlation between the peak value of the video memory and the running data of the first calculation graph.
在一些实施例中,第一计算图的显存峰值为该第一计算图在运行过程中所需显存空间的峰值。通过遍历第一计算图,可确定依次运行第一计算图中的所有算子所需的显存空间,以及在整个运行过程中出现的显存峰值。运行数据指的是第一计算图的运行数据。第一计算图的运行数据包括:在遍历该第一计算图的过程中每一算子所产生的数据。在该步骤中,通过遍历第一计算图,能够得到该第一计算图所需显存的峰值,以及每一算子的运行数据,从而分析出现该显存峰值的时刻与每一算子的运行数据的产生时刻以及该运行数据的应用时刻的时序关系;并将该时序关系作为第一计算图的显存峰值与运行数据之间的关联关系。比如,一些算子的运行数据产生时刻与该运行数据的应用时刻均在显存峰值出现时刻之后,或者,一些算子的运行数据产生时刻与该运行数据的应用时刻均在显存峰值出现时刻之前,或者,一些算子的运行数据产生时刻在显存峰值出现时刻之前,且该运行数据的应用时刻在显存峰值出现时刻之后。In some embodiments, the peak value of the video memory of the first computation graph is the peak value of the video memory space required by the first computation graph during operation. By traversing the first computation graph, it is possible to determine the video memory space required to run all the operators in the first computation graph sequentially, as well as the video memory peak value that occurs during the entire running process. The running data refers to the running data of the first computation graph. The running data of the first computation graph includes: data generated by each operator during the process of traversing the first computation graph. In this step, by traversing the first computation graph, the peak value of the video memory required by the first computation graph and the operating data of each operator can be obtained, so as to analyze the time when the video memory peak value appears and the operating data of each operator The timing relationship between the generation time and the application time of the running data; and the timing relationship is used as the association relationship between the video memory peak value of the first calculation graph and the running data. For example, the generation time of the operation data of some operators and the application time of the operation data are both after the occurrence time of the peak value of the video memory, or the generation time of the operation data of some operators and the application time of the operation data are both before the time of the peak value of the video memory. Or, the running data generation time of some operators is before the time when the video memory peak occurs, and the application time of the running data is after the video memory peak time.
步骤S103,基于所述关联关系,对所述第一计算图进行调整,生成至少一个第二计算图。Step S103, based on the association relationship, adjust the first computation graph to generate at least one second computation graph.
在一些实施例中,按照第一计算图中显存峰值的出现时刻,与算子的运行数据的生成时刻以及该运行数据的应用时刻之间的时序关系,确定该第一计算图的至少一种优化方案;并基于该优化方案对第一计算图进行调整,生成每一优化方案对应的第二计算图。在一些可能的实现方式中,通过该第一计算图的任一种优化方案对第一计算图进行调整,将调整后的第一计算图作为该第二计算图,即第二计算图是通过采用该优化方案对第一计算图调整后得到的。如此,在确定第一计算图的多种优化方案 的情况下,通过该多种优化方案逐一对第一计算图进行调整,可得到多个第二计算图。In some embodiments, at least one of the first calculation graphs is determined according to the timing relationship between the occurrence time of the video memory peak in the first calculation graph, the generation time of the operator's operation data, and the application time of the operation data. An optimization scheme; and adjusting the first calculation graph based on the optimization scheme to generate a second calculation graph corresponding to each optimization scheme. In some possible implementations, the first computation graph is adjusted by any optimization scheme of the first computation graph, and the adjusted first computation graph is used as the second computation graph, that is, the second computation graph is obtained by It is obtained after adjusting the first calculation graph by adopting the optimization scheme. In this way, in the case of determining multiple optimization schemes for the first calculation graph, multiple second calculation graphs can be obtained by adjusting the first calculation graph one by one through the multiple optimization schemes.
在一些可能的实现方式中,对第一计算图的调整可以是,通过分析显存峰值与运行数据之间的关联关系,对第一计算图进行调整。如果该关联关系满足预设条件,即如果第一计算图中任一算子的运行数据的生成时刻在该显存峰值的出现时刻之前,但是该算子的运行数据的应用时刻在该显存峰值的出现时刻之后,那么将该算子在第一计算图中的执行时刻移动到显存峰值之后,实现对第一计算图的调整,得到第二计算图。在第一计算图中,筛选出目标运行数据对应的算子,目标运行数据即为运行数据产生时刻在显存峰值出现时刻之前,且该运行数据的应用时刻在显存峰值出现时刻后的运行数据;通过在第一计算图中对该目标运行数据对应的算子进行移动,从而得到至少一个第二计算图。In some possible implementation manners, the adjustment to the first calculation graph may be to adjust the first calculation graph by analyzing the relationship between the peak value of the video memory and the running data. If the association relationship satisfies the preset condition, that is, if the generation time of the operation data of any operator in the first calculation graph is before the occurrence time of the video memory peak, but the application time of the operation data of the operator is before the occurrence time of the video memory peak After the occurrence time, then move the execution time of the operator in the first calculation graph to after the peak value of the video memory, realize the adjustment to the first calculation graph, and obtain the second calculation graph. In the first calculation diagram, the operator corresponding to the target operating data is screened out, and the target operating data is the operating data whose generation time is before the occurrence time of the video memory peak value, and the application time of the operation data is after the video memory peak value occurrence time; At least one second calculation graph is obtained by moving the operator corresponding to the target operation data in the first calculation graph.
在一些可能的实现方式中,该预设条件可以是依据算子的运行数据产生时刻、运行数据的应用时刻以及显存峰值出现时刻之间的先后关系设定的,比如,设定该预设条件为算子的运行数据产生时刻在显存峰值出现时刻之前且该运行数据的应用时刻在显存峰值出现时刻之后。这样,在第一计算图的各个算子的运行数据中,筛选出算子的运行数据产生时刻在显存峰值出现时刻之前且应用时刻在显存峰值出现时刻之后的目标运行数据。将每一个目标运行数据作为第一计算图的一个优化方案,通过移动这些目标运行数据对应的算子,从而生成至少一个第二计算图。In some possible implementations, the preset condition can be set according to the sequence relationship between the time when the operator's running data is generated, the time when the running data is applied, and the time when the peak value of the video memory occurs. For example, setting the preset condition The generation time of the operation data of the operator is before the occurrence time of the video memory peak and the application time of the operation data is after the occurrence time of the video memory peak. In this way, among the operating data of each operator in the first calculation graph, the target operating data whose operating data generation time of the operator is before the occurrence time of the video memory peak value and whose application time is after the video memory peak value occurrence time point is filtered out. Each target operation data is used as an optimization scheme of the first calculation graph, and at least one second calculation graph is generated by moving the operators corresponding to the target operation data.
步骤S104,基于所述至少一个第二计算图的显存峰值和运行时长,在所述至少一个第二计算图中确定目标计算图。Step S104: Determine a target computation graph in the at least one second computation graph based on the peak value of the video memory and the runtime of the at least one second computation graph.
在一些实施例中,通过遍历第二计算图,得到第二计算图所需显存空间的显存峰值,即第二计算图的显存峰值,以及运行该第二计算图所需的运行时长,即第二计算图的运行时长。通过结合该第二计算图的运行时长,在该多个第二计算图中查找显存开销满足预设显存空间的计算图,即可得到该目标计算图。In some embodiments, by traversing the second computation graph, the peak value of the video memory space required by the second computation graph is obtained, that is, the peak value of the video memory of the second computation graph, and the running time required to run the second computation graph, that is, the first 2. The running time of the computation graph. The target computation graph can be obtained by combining the running time of the second computation graph and searching for a computation graph whose video memory overhead satisfies the preset video memory space in the plurality of second computation graphs.
在一些可能的实现方式中,重新运行每一第二计算图,得到该第二计算图的显存峰值和运行时长;通过获取设定的显存开销预算以及设定的用于权衡运行时长与显存空间比重的参数,结合新生成的第二计算图的显存峰值、运行时长以及该第二计算图对应的原始计算图的运行时长,确定该第二计算图的评分;按照每一第二计算图的评分,从多个第二计算图中搜索显存开销满足预设显存空间的目标计算图,或者搜索显存开销最优的目标计算图。In some possible implementations, each second calculation graph is re-run to obtain the peak memory and running time of the second computing graph; by obtaining the set memory overhead budget and setting the trade-off between running time and memory space The parameters of the specific gravity, combined with the newly generated second calculation graph's video memory peak value, running time, and the running time of the original computing graph corresponding to the second computing graph, determine the score of the second computing graph; according to the score of each second computing graph Scoring, searching for a target computing graph whose video memory overhead meets a preset video memory space from multiple second computing graphs, or searching for a target computing graph with an optimal video memory overhead.
步骤S105,基于所述目标计算图,确定所述预设网络模型所需的显存空间。Step S105, based on the target computation graph, determine the display memory space required by the preset network model.
在一些实施例中,通过运行该目标计算图,可预估训练该预设网络模型所需的显存空间以及花费的时间。将该目标计算图的运行时长作为训练该预设网络模型的时长,将该目标计算图运行所需的显存空间,作为训练该预设网络模型所需的显存空间,而且随着模型的增大以及批大小的增加,目标计算图可优化的显存空间也将变大,从而能够更进一步的优化训练该网络模型的显存空间。In some embodiments, by running the target computation graph, it is possible to estimate the video memory space required for training the preset network model and the time it takes. The running time of the target computing graph is used as the duration of training the preset network model, and the video memory space required for running the target computing graph is used as the video memory space required for training the preset network model, and as the model increases As well as the increase of the batch size, the optimized video memory space of the target calculation graph will also become larger, so that the video memory space for training the network model can be further optimized.
在本公开实施例中,对于获取的预设网络模型,首先,通过生成表示该网络模型运行过程的第一计算图,并分析该第一计算图中运行所需的显存峰值与各算子的运行数据之间的关联关系,从而能够通过对第一计算图进行优化生成至少一个第二计算图;然后,综合考虑该第二计算图的显存峰值以及运行该第二计算图所需的运行时长,在多个第二计算图中查找目标计算图;这样,搜索到的目标计算图既能够优化显存空间,还能够兼顾运行时长。最后,通过该目标计算图,实现对预设网络模型所需的显存空间的优化。如此,通过在生成的多个第二计算图中搜索显存花销最优的目标计算图,而且在搜索目标计算图的过程中,将计算图的时间开销也综合加入到计算图的考量之中,使得最终的目标计算图同时满足空间预算和时间预算,从而在不牺牲大量时间成本的基础上,优化了显存空间。In the embodiment of the present disclosure, for the obtained preset network model, firstly, by generating the first calculation graph representing the operation process of the network model, and analyzing the peak value of the video memory required for operation in the first calculation graph and the value of each operator Run the correlation between the data, so that at least one second calculation graph can be generated by optimizing the first calculation graph; then, comprehensively consider the peak value of the video memory of the second calculation graph and the running time required to run the second calculation graph , search for the target computation graph in multiple second computation graphs; in this way, the searched target computation graph can not only optimize the video memory space, but also take into account the running time. Finally, through the target calculation graph, the optimization of the video memory space required by the preset network model is realized. In this way, by searching for the target computing graph with the optimal memory cost in the generated multiple second computing graphs, and in the process of searching the target computing graph, the time overhead of the computing graph is also integrated into the consideration of the computing graph , so that the final target calculation graph satisfies both the space budget and the time budget, thereby optimizing the video memory space without sacrificing a lot of time cost.
在一些实施例中,通过读入不同深度学习框架提取的JSON格式的计算图信息,生成多个第一计算图,即上述步骤S101可以通过以下步骤S111和S112(图示未示出)实现:In some embodiments, multiple first calculation graphs are generated by reading in JSON-formatted calculation graph information extracted by different deep learning frameworks, that is, the above step S101 can be implemented through the following steps S111 and S112 (not shown):
步骤S111,基于所述预设网络模型,生成数据交换格式的计算图信息。Step S111, based on the preset network model, generating computation graph information in a data exchange format.
在一些实施例中,将该预设网络模型输入到不同的深度学习框架中,以提取该预设网络模型的JSON文件格式的计算图信息,该计算图信息中包括执行不同任务的算子,操作数以及每一算子的输入和输出等。In some embodiments, the preset network model is input into different deep learning frameworks to extract the calculation graph information of the preset network model in JSON file format, the calculation graph information includes operators performing different tasks, Operands and the input and output of each operator, etc.
步骤S112,基于所述计算图信息中的算子队列,生成所述计算图信息匹配的第一计算图。Step S112, based on the operator queue in the computation graph information, generate a first computation graph matching the computation graph information.
在一些实施例中,通过分析各个算子在预设神经网络中执行任务的次序,按照执行任务的次序,将各算子编排为队列;并将该各算子的输入输出依次与操作数进行匹配,通过操作边将各个算子连接起来,形成第一计算图;进而基于采用不同深度学习框架提取的JSON格式的计算图信息,能够得到多个第一计算图,即第一计算图。如此,通过采用深度学习框架将需要训练的模型生成JSON格式的计算图,能够得到第一计算图,进而通过多种优化方案对第一计算图进行调整生成多个第二计算图,便于从中筛选显存花销最优的计算图。In some embodiments, by analyzing the order in which each operator performs tasks in the preset neural network, each operator is arranged into a queue according to the order in which the tasks are performed; and the input and output of each operator are sequentially compared with the operands Matching is to connect operators through operation edges to form a first calculation graph; then based on the calculation graph information in JSON format extracted by using different deep learning frameworks, multiple first calculation graphs, that is, the first calculation graph can be obtained. In this way, by using the deep learning framework to generate a calculation graph in JSON format for the model that needs to be trained, the first calculation graph can be obtained, and then the first calculation graph can be adjusted through various optimization schemes to generate multiple second calculation graphs for easy screening. Computational graph with optimal memory cost.
在一些实施例中,通过分析第一计算图中算子的运行数据的生成时刻以及应用时刻,与峰值出现时刻的时序关系,得到第一计算图的显存峰值与运行数据之间的关联关系,即上述步骤S102可以通过以下步骤S121至S124(图示未示出)实现:In some embodiments, by analyzing the timing relationship between the generation time and application time of the operator's operating data in the first calculation graph, and the time when the peak value appears, the correlation between the video memory peak value and the operating data in the first calculation graph is obtained, That is, the above step S102 can be realized through the following steps S121 to S124 (not shown in the figure):
步骤S121,确定所述第一计算图中显存峰值的出现时刻。Step S121, determining the occurrence time of the video memory peak in the first calculation graph.
在一些实施例中,通过遍历运行第一计算图,能够确定出该第一计算图所需显存空间达到峰值的时刻;即运行第一计算图的过程中,显存峰值出现在整个运行时长内的时刻。In some embodiments, by traversing and running the first computing graph, it is possible to determine the moment when the video memory space required by the first computing graph reaches its peak; that is, during the running of the first computing graph, the peak value of the video memory occurs within the entire running time time.
步骤S122,确定所述第一计算图中算子的运行数据。Step S122, determining operation data of operators in the first computation graph.
在一些实施例中,通过遍历运行每一第一计算图,能够得到第一计算图中每一算子在运行过程中产生的数据、产生该数据的时刻以及应用该数据的时刻,即运行数据、运行数据的生成时刻和运行数据的应用时刻。In some embodiments, by traversing and running each first calculation graph, the data generated by each operator in the first calculation graph during operation, the time when the data is generated, and the time when the data is applied, that is, the running data , the generation time of the operation data and the application time of the operation data.
步骤S123,确定所述运行数据的生成时刻和所述运行数据在所述第一计算图中的应用时刻。Step S123, determining the generation time of the operation data and the application time of the operation data in the first calculation graph.
在一些实施例中,通过遍历运行每第一计算图,能够得到第一计算图中显存峰值的出现时刻、每一算子在运行过程中产生的运行数据的生成时刻以及该运行数据的应用时刻。In some embodiments, by traversing and running each first computation graph, it is possible to obtain the occurrence time of the video memory peak in the first computation graph, the generation time of the operation data generated by each operator during the operation process, and the application time of the operation data .
步骤S124,确定所述生成时刻和所述应用时刻,与所述显存峰值的出现时刻之间的时序关系,为所述关联关系。Step S124, determining the timing relationship between the generation time, the application time, and the occurrence time of the video memory peak as the association relationship.
在一些实施例中,通过分析该运行数据的生成时刻和该运行数据的应用时刻,与该显存峰值的出现时刻之间的时序关系,得到显存峰值与运行数据之间的关联关系。如此,通过分析算子生成的运行数据的时刻和应用该运行数据的时刻,与达到显存峰值的时刻之间的时序关系,能够进一步确定是否需要移动该算子以降低峰值。In some embodiments, the correlation between the peak video memory and the running data is obtained by analyzing the timing relationship between the generation time of the running data, the application time of the running data, and the occurrence time of the video memory peak. In this way, by analyzing the timing relationship between the time of running data generated by the operator, the time when the running data is applied, and the time when the peak value of the video memory is reached, it can be further determined whether the operator needs to be moved to reduce the peak value.
在一些实施例中,通过分析算子的运行数据的生成时刻以及运行数据的应用时刻,与峰值出现时刻的时序关系,确定该运行数据是否满足预设条件,进而对该第一计算图进行调整,生成至少一个第二计算图,即上述步骤S103可以通过图2所示的步骤实现,图2为本公开实施例提供的显存优化方法的另一实现流程示意图,结合图1和2所示的步骤进行以下说明:In some embodiments, by analyzing the timing relationship between the generation time of the operator's operation data and the application time of the operation data, and the peak time, it is determined whether the operation data meets the preset conditions, and then the first calculation graph is adjusted. , to generate at least one second calculation graph, that is, the above step S103 can be realized through the steps shown in FIG. 2 . FIG. The steps are described below:
步骤S201,在所述第一计算图中,确定所述关联关系满足预设条件的目标运行数据。Step S201, in the first calculation diagram, determine the target operating data whose association relationship satisfies a preset condition.
在一些实施例中,在所述第一计算图中的算子的运行数据中,确定生成时刻在所述显存峰值的出现时间之前且应用时刻在所述显存峰值的出现时刻之后的运行数据,为满足所述预设条件的目标运行数据。在一些可能的实现方式中,在所述每一第一计算图中算子的运行数据中,确定生成时刻在所述显存峰值的出现时间之前的运行数据;响应于所述目标运行数据的应用时刻在所述显存峰值的出现时刻之后,确定该运行数据为目标运行数据。In some embodiments, among the operation data of the operators in the first calculation graph, it is determined that the generation time is before the occurrence time of the video memory peak and the application time is after the occurrence time of the video memory peak, Running data for a target meeting the preset condition. In some possible implementation manners, among the operation data of the operators in each first computation graph, determine the operation data whose generation time is before the occurrence time of the video memory peak value; in response to the application of the target operation data If the time is after the occurrence time of the video memory peak value, the running data is determined to be the target running data.
在一些实施例中,在运行第一计算图之后得到该第一计算图中各个算子在运行过程中产生的运行数据,从这些算子的运行数据中,查找运行数据的生成时刻在该第一计算图的显存峰值出现时刻之前的运行数据。即运行数据是算子在显存峰值出现之前产生,说明该运行数据占据的显存包含在该显存峰值中。在查找到目标运行数据之后,确定该目标运行数据的应用时刻;并判断该应用时刻是否在显存峰值的出现时刻之后,如果该运行数据的应用时刻是否在显存峰值的出现时刻之后,即说明该算子在运行第一计算图的过程中产生运行数据的时刻在显存峰值的出现时刻之前,但是该运行数据的应用时刻在显存峰值的出现时刻之后;进一步说明该运行数据虽然增加了显存峰值的大小,但是并没有在达到显存峰值之前使用,而是在达到显存峰值之后才进行使用,将这样的运行数据作为目标运行数据。In some embodiments, after running the first calculation graph, the operation data generated by each operator in the first calculation graph during operation is obtained, and from the operation data of these operators, it is found that the generation time of the operation data is within the first calculation graph. The running data before the peak value of video memory of a computing graph occurs. That is, the running data is generated by the operator before the peak value of the video memory, which means that the video memory occupied by the running data is included in the peak value of the video memory. After finding the target operating data, determine the application time of the target operating data; and judge whether the application time is after the occurrence time of the video memory peak, if the application time of the operation data is after the occurrence time of the video memory peak value, it means the The time when the operator generates the running data in the process of running the first calculation graph is before the occurrence time of the video memory peak value, but the application time of the running data is after the occurrence time of the video memory peak value; Size, but it is not used before reaching the peak value of the video memory, but after reaching the peak value of the video memory, and such operating data is used as the target operating data.
步骤S202,基于所述目标运行数据,对所述第一计算图进行调整,生成所述至少一个第二计算图。Step S202, based on the target operation data, adjust the first calculation graph to generate the at least one second calculation graph.
在一些实施例中,在第一计算图中,将关联关系满足预设条件的目标运行数据对应的算子移动到峰值时刻之后,得到第二计算图。In some embodiments, in the first calculation graph, after the operator corresponding to the target operating data whose association relationship satisfies the preset condition is moved to the peak time, the second calculation graph is obtained.
在本公开实施例中,在第一计算图中筛选运行数据的生成时刻在显存峰值的出现时刻之前,且运行数据的应用时刻在显示峰值的出现时刻之后情况的目标运行数据,在第一计算图中,移动目标运行数据对应的算子,得到第二计算图,从而能够降低第二计算图的峰值。In the embodiment of the present disclosure, in the first calculation graph, the target operating data in the case where the generation time of the operating data is before the occurrence time of the video memory peak and the application time of the operating data is after the occurrence time of the display peak is screened in the first calculation graph. In the figure, the moving target runs the operator corresponding to the data to obtain the second calculation graph, so that the peak value of the second calculation graph can be reduced.
在一些实施例中,通过移动目标运行数据对应的算子在第一计算图中的位置,生成第二计算图,并基于该第二计算图的显存峰值等信息,从中搜索出目标计算图,即上述图2中的步骤S202可以通过图3A所示的步骤实现,图3A为本公开实施例提供的显存优化方法的另一实现流程示意图,结合图3A所示的步骤进行以下说明:In some embodiments, the second computation graph is generated by moving the position of the operator corresponding to the target operation data in the first computation graph, and the target computation graph is searched based on information such as the peak value of the video memory of the second computation graph, That is, step S202 in FIG. 2 above can be realized through the steps shown in FIG. 3A . FIG. 3A is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure. The following description is made in conjunction with the steps shown in FIG. 3A :
步骤S301,在所述第一计算图中,确定目标运行数据对应的目标算子。Step S301, in the first computation graph, determine the target operator corresponding to the target operation data.
在一些实施例中,针对第一计算图,在该第一计算图的多个算子中,确定产生该目标运行数据的算子,即目标算子。In some embodiments, for the first computation graph, among the multiple operators in the first computation graph, an operator that generates the target operation data, that is, a target operator is determined.
步骤S302,基于所述第一计算图中所述显存峰值的出现时刻,调整所述第一计算图中的所述目标算子,生成所述至少一个第二计算图。Step S302, based on the occurrence time of the video memory peak in the first computation graph, adjust the target operator in the first computation graph to generate the at least one second computation graph.
在一些实施例中,遍历运行计算图之后,得到了第一计算图中达到显存峰值的时刻。按照该时刻调整目标算子在第一计算图中的执行时刻,即移动该目标算子在第一计算图中的位置得到第二计算图。这样,针对第一计算图,通过按照该第一计算图中显存峰值的出现时刻,移动目标算子,从而得到多个第二计算图。如图3B所示,如果通过分析第一计算图31的优化方案,确定第一计算图31中包括两个目标算子,即该第一计算图31具有两个优化方案。在第一计算图31中移动任一目标算子,生成两个优化的计算图,即第二计算图32和第二计算图33。In some embodiments, after traversing the running computation graph, the moment when the video memory reaches the peak value in the first computation graph is obtained. Adjust the execution time of the target operator in the first computation graph according to this moment, that is, move the position of the target operator in the first computation graph to obtain the second computation graph. In this way, for the first computation graph, by moving the target operator according to the occurrence time of the video memory peak in the first computation graph, multiple second computation graphs are obtained. As shown in FIG. 3B , if it is determined by analyzing the optimization scheme of the first calculation graph 31 that the first calculation graph 31 includes two target operators, that is, the first calculation graph 31 has two optimization schemes. Any target operator is moved in the first computation graph 31 to generate two optimized computation graphs, that is, the second computation graph 32 and the second computation graph 33 .
在一些可能的实现方式中,通过将该目标算子移动到显存峰值的出现时刻之后,以优化第一计算图,生成第二计算图,即,上述步骤S302可以通过以下过程实现:In some possible implementations, the first calculation graph is optimized by moving the target operator to after the peak of the video memory to generate the second calculation graph, that is, the above step S302 can be implemented through the following process:
在所述第一计算图中,将所述目标算子的执行时刻调整至所述显存峰值的出现时刻之后,生成所述第二计算图。In the first computation graph, the second computation graph is generated after the execution time of the target operator is adjusted to the occurrence time of the video memory peak.
这里,在所述第一计算图中,将目标算子的执行时刻调整至显存峰值的出现时刻之后,生成第一计算图对应的第二计算图。针对第一计算图,移动该第一计算图中目标算子的位置;由于该目标算子产生目标运行数据的时刻在峰值的出现时刻之前,且该目标运行数据是在峰值的出现时刻之后才使用的,所以,通过将该目标算子的执行时刻移动到显存峰值的出现时刻之后,能够降低生成的第二计算图的峰值。如此,通过移动目标算子至显存峰值之后,能够降低新生成的第一计算图的峰值,从而优化第二计算图所需的显存空间。Here, in the first computation graph, after the execution time of the target operator is adjusted to the occurrence time of the video memory peak value, a second computation graph corresponding to the first computation graph is generated. For the first calculation graph, move the position of the target operator in the first calculation graph; because the time when the target operator generates the target operating data is before the peak time, and the target operating data is after the peak time. Therefore, by moving the execution time of the target operator to after the occurrence time of the video memory peak value, the peak value of the generated second calculation graph can be reduced. In this way, by moving the target operator to the peak value of the video memory, the peak value of the newly generated first computation graph can be reduced, thereby optimizing the video memory space required by the second computation graph.
在本公开实施例中,通过在第二计算图中移动目标算子到显存峰值的出现时刻之后,能够生成降低了显存峰值的第三计算图,从而便于在至少一个第三计算图中搜索到已优化显存开销的目标计算图。In the embodiment of the present disclosure, by moving the target operator in the second computation graph to after the occurrence time of the video memory peak value, a third computation graph with a reduced video memory peak value can be generated, so that it is convenient to search for an existing operator in at least one third computation graph. A target computation graph that optimizes memory overhead.
在一些实施例中,通过对每一第二计算图进行显存空间开销的评分,对第二计算图进行排序,按照排序后的队列搜索目标计算图,即上述步骤S303可以通过图4A所示的步骤实现,图4A为本公开实施例提供的显存优化方法的再一实现流程示意图,结合图3A和4A所示的步骤进行以下说明:In some embodiments, by scoring the memory space overhead of each second computation graph, sorting the second computation graph, and searching for the target computation graph according to the sorted queue, that is, the above step S303 can be performed through the Steps to achieve, Figure 4A is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure, combined with the steps shown in Figures 3A and 4A, the following description is made:
步骤S401,获取预设显存开销和预设权衡比值。Step S401, acquiring a preset video memory overhead and a preset trade-off ratio.
在一些实施例中,预设显存开销为设定的显存申请量,即事先设定的显存空间的大小。预设权衡比值是事先设定的一个权衡时间和空间所占得分比重的参数,该比值小于1。所述预设权衡比值用于权衡计算图的运行时长和所需显存之间的比重。In some embodiments, the preset video memory overhead is a set video memory application amount, that is, a preset video memory space size. The preset trade-off ratio is a parameter set in advance to weigh the proportion of time and space in the score, and the ratio is less than 1. The preset trade-off ratio is used to weigh the proportion between the running time of the calculation graph and the required video memory.
步骤S402,基于所述预设显存开销和预设权衡比值,对所述每一第二计算图的显存峰值、运行时长和所述对应的第一计算图的运行时长进行评分,得到所述每一第二计算图的评分结果。Step S402, based on the preset video memory overhead and the preset trade-off ratio, score the video memory peak value, running time of each second computing graph, and the running time of the corresponding first computing graph, and obtain the - Scoring results of the second computation graph.
在一些实施例中,将设定的预设显存开销和预设权衡比值结合第二计算图显存峰值和运行时长,同时综合考虑该第二计算图的原始计算图(即生成该第二计算图的第一计算图)的运行时长,来评估该第二计算图的显存开销以及运行时长的优劣,从而得到该第二计算图的评分结果。这样,通过综合 考虑每一第二计算图的显存开销和运行时长,对每一第二计算图进行评分,得到每一个第二计算图的评分结果。该评分结果可以是一个分值,分值越大表示该第二计算图的显存开销和运行时长的综合性能越好。In some embodiments, the set preset video memory overhead and the preset trade-off ratio are combined with the peak video memory and runtime of the second computation graph, and at the same time, the original computation graph of the second computation graph is considered comprehensively (that is, the second computation graph is generated The running time of the first computing graph) is used to evaluate the memory overhead and running time of the second computing graph, so as to obtain the scoring result of the second computing graph. In this way, by comprehensively considering the memory overhead and running time of each second calculation graph, each second calculation graph is scored, and the scoring result of each second calculation graph is obtained. The scoring result may be a score, and a larger score indicates better overall performance of the second computation graph's memory overhead and runtime.
在一些可能的实现方式中,通过对第二计算图的运行时长和显存峰值进行综合评估,得到该第二计算图的评分结果,即上述步骤S402可以通过以下步骤S421至S423(图示未示出)实现:In some possible implementations, the scoring result of the second computing graph is obtained by comprehensively evaluating the running time of the second computing graph and the peak value of the video memory, that is, the above step S402 can be performed through the following steps S421 to S423 (not shown out) to achieve:
步骤S421,基于所述每一第二计算图的显存峰值、预设显存开销和预设权衡比值,确定所述每一第二计算图的显存评分。Step S421, based on the peak value of the video memory, the preset video memory cost and the preset trade-off ratio of each second computing graph, determine the video memory score of each second computing graph.
在一些实施例中,对于任一第二计算图,将该第二计算图的显存峰值减去预设显存开销,得到差值(该差值可以为正数,也可以为负数,比如,在该第二计算图的显存峰值大于该预设显存开销的情况下,该差值为负数;在该第二计算图的显存峰值小于该预设显存开销的情况下,该差值为正数)。将该差值与设定的预设权衡比值相乘,即可得到该第二计算图的显存评分。In some embodiments, for any second computation graph, the peak value of the second computation graph is subtracted from the preset memory overhead to obtain a difference (the difference can be a positive number or a negative number, for example, in If the peak value of the video memory of the second computation graph is greater than the default memory cost, the difference is a negative number; if the peak value of the memory of the second computation graph is smaller than the preset memory cost, the difference is a positive number) . The video memory score of the second computation graph can be obtained by multiplying the difference with the set preset trade-off ratio.
步骤S422,基于所述预设权衡比值、所述每一第二计算图的运行时长和所述对应的第一计算图的运行时长,确定所述每一第二计算图的运行时长评分。Step S422, based on the preset trade-off ratio, the running time of each second computing graph, and the corresponding running time of the first computing graph, determine the running time score of each second computing graph.
在一些实施例中,采用预设标准参数(比如,设定为1)减去预设权衡比值,作为评估运行时长的比值。将该第二计算图的运行时长减去对应的第二计算图的运行时长,得到时长差值(一般情况下,该时长差值为正数)。采用计算得到的评估运行时长的比值与该时长差值进行相乘,即可得到该第二计算图的运行时长评分。In some embodiments, a preset standard parameter (for example, set to 1) minus a preset trade-off ratio is used as the ratio of the estimated running time. The running time of the second calculation graph is subtracted from the running time of the corresponding second computing graph to obtain a time difference (generally, the time difference is a positive number). The running time score of the second calculation graph can be obtained by multiplying the calculated evaluation running time ratio by the time difference value.
步骤S423,基于所述每一第二计算图的显存评分和所述运行时长评分,确定所述每一第二计算图像的评分结果。Step S423, based on the video memory score and the running time score of each second calculation graph, determine the score result of each second calculation image.
在一些实施例中,针对任一第二计算图,将该第二计算图的显存评分和运行时长评分相加,即可得到该第二计算图的评分结果,从而能够得到至少一个第二计算图中每一第二计算图的显存和运行时长的综合评分。如此,在对第二计算图进行评分的阶段,综合考虑第二计算图的运行时长和显存峰值,使最后的计算图在优化显存空间的同时不会牺牲大量时间成本。In some embodiments, for any second calculation graph, the score of the second calculation graph can be obtained by adding the video memory score and runtime score of the second calculation graph, so that at least one second calculation graph can be obtained The comprehensive score of video memory and running time of each second computing graph in the graph. In this way, in the stage of scoring the second computing graph, the running time of the second computing graph and the peak value of video memory are comprehensively considered, so that the final computing graph can optimize the video memory space without sacrificing a lot of time cost.
步骤S403,基于所述每一第二计算图的评分结果,对所述至少一个第二计算图中的第二计算图进行排序,得到排序队列。Step S403, sort the second computation graphs in the at least one second computation graph based on the scoring result of each second computation graph, to obtain a sorted queue.
在一些实施例中,按照至少一个第二计算图中每一第二计算图的评分的大小,对这些第二计算图进行排序;比如,按照评分结果从大到小,对至少一个第二计算图中的第二计算图进行排序,得到该排序队列;或者,按照按照评分结果从小到大,对至少一个第二计算图中的第二计算图进行排序,得到该排序队列。如图3B所示,如果第二计算32的评分大于第二计算33的评分,那么这两个第二计算图的排列队列如图3所示,第二计算图32在前,第二计算图33在后。In some embodiments, the second calculation graphs are sorted according to the scores of each second calculation graph in at least one second calculation graph; The second computation graphs in the graph are sorted to obtain the sort queue; or, the second computation graphs in at least one second computation graph are sorted according to the scoring results from small to large to obtain the sort queue. As shown in Figure 3B, if the score of the second calculation 32 is greater than the score of the second calculation 33, the arrangement of the two second calculation graphs is as shown in Figure 3, the second calculation graph 32 is in front, and the second calculation graph 33 after.
步骤S404,基于所述排序队列,在所述至少一个第二计算图中确定所述目标计算图。Step S404, based on the sorting queue, determine the target computation graph in the at least one second computation graph.
在一些实施例中,由于该排序队列是基于第二计算图的评分大小进行排列的,所以按照该排序队列中的排列次序,首先搜索评分最高的第二计算图的显存开销是否满足预设显存开销,如果评分最高的第二计算图的显存开销满足预设显存开销,则将该第二计算图作为目标计算图;如果不满足,再继续分析该第二计算图中是否存在目标运行数据,以产生该第二计算图的优化方案;从而基于该优化方案对该第二计算图进行调整,生成第三计算图,按照第三计算图的评分结果,将第三计算图列入排序队列中,继续在更新的排列对象中搜索评分排列最高的计算图的显存开销是否满足预设显存开销;最后,在搜索次数达到上限且未搜索到显存开销满足预设显存开销的计算图,将当前排列队列中评分最高的计算图作为目标计算图。如此,通过综合考虑第二计算图的运行时长和显存峰值,对第二计算图进行评分,并按照评分结果排列为优先队列,从而能够通过队列搜索查找到显存开销最优的目标计算图。In some embodiments, since the sorting queue is arranged based on the score size of the second computing graph, according to the sorting order in the sorting queue, first search whether the video memory overhead of the second computing graph with the highest score satisfies the preset video memory Overhead, if the video memory cost of the second computing graph with the highest score meets the preset video memory cost, use the second computing graph as the target computing graph; if not, continue to analyze whether there is target running data in the second computing graph, to generate an optimization scheme for the second calculation graph; thereby adjust the second calculation graph based on the optimization scheme to generate a third calculation graph, and list the third calculation graph in the sorting queue according to the scoring results of the third calculation graph , continue to search for the memory cost of the calculation graph with the highest score in the updated arrangement object to meet the preset memory cost; finally, when the number of searches reaches the upper limit and no calculation graph whose memory cost meets the preset memory cost is found, the current ranking The computation graph with the highest score in the queue is used as the target computation graph. In this way, by comprehensively considering the running time of the second computing graph and the peak value of the video memory, the second computing graph is scored and arranged in a priority queue according to the scoring results, so that the target computing graph with the optimal memory cost can be found through queue search.
在一些可能的实现方式中,按照排列队列中第二计算图的排列顺序对第二计算图所需的显存空间进行搜索,以搜索到显存开销满足预设显存开销的目标计算图,即步骤S404可以通过以下多种方式实现:In some possible implementations, the video memory space required by the second computing graph is searched according to the order in which the second computing graph is arranged in the queue, so as to search for a target computing graph whose video memory cost meets the preset video memory cost, that is, step S404 This can be achieved in a number of ways:
方式一:对排列队列总评分结果最优的第二计算图进行显存空间的搜索,以判断该第二计算图的 显存开销是否满足预设显存开销,包括以下步骤S441和S442(图示未示出):Method 1: Search the memory space of the second calculation graph with the best total scoring result of the queue to determine whether the memory overhead of the second calculation graph meets the preset memory overhead, including the following steps S441 and S442 (not shown in the figure) out):
步骤S441,在所述排列队列中,搜搜评分结果最优的第一候选计算图。Step S441, in the arrangement queue, search for the first candidate computation graph with the best scoring result.
在一些实施例中,如果该排列队列是基于第二计算图的评分从大到小排序得到的,那么排列在队首的元素即为评分结果最优的第一候选计算图。通过运行该评分结果最优的第一候选计算图,可确定该候选计算图所需的显存空间。In some embodiments, if the queue is sorted from largest to smallest based on the scores of the second computation graph, then the element arranged at the head of the queue is the first candidate computation graph with the best scoring result. By running the first candidate computation graph with the best scoring result, the video memory space required by the candidate computation graph can be determined.
步骤S442,响应于搜索到的所述第一候选计算图所需的显存空间满足预设显存开销,确定所述第一候选计算图为所述目标计算图。Step S442, in response to the found video memory space required by the first candidate computation graph meeting the preset video memory overhead, determine the first candidate computation graph as the target computation graph.
在一些实施例中,确定第一候选计算图所需的显存空间之后,判断该显存空间是否满足预设显存开销,即判断在该预设显存开销内是否能够正常运行该第一候选计算图。如果第一候选计算图所需的显存空间满足预设显存开销,说明该第一候选计算图所需的显存空间在预设显存开销范围内,即通过该第一候选计算图对应的显存空间能够完成对预设网络模型的训练。进而将该第一候选计算图作为目标计算图。如此,基于排队序列,优先搜索评分结果最优的第一候选计算图所需的显存空间是否满足预设显存开销,既能够尽可能地减少搜索次数,还能够使得搜索到的目标计算图的显存开销较小。In some embodiments, after the video memory space required by the first candidate computation graph is determined, it is judged whether the video memory space satisfies the preset video memory overhead, that is, it is judged whether the first candidate computation graph can be run normally within the preset video memory overhead. If the video memory space required by the first candidate computation graph satisfies the preset video memory overhead, it means that the video memory space required by the first candidate computation graph is within the preset video memory overhead range, that is, the video memory space corresponding to the first candidate computation graph can Complete the training of the preset network model. Furthermore, the first candidate computation graph is used as a target computation graph. In this way, based on the queuing sequence, firstly search whether the video memory space required by the first candidate computing graph with the best scoring result satisfies the preset video memory overhead, which can not only reduce the number of searches as much as possible, but also make the video memory of the searched target computing graph Less overhead.
方式二:在评分结果最优的第二计算图的显存开销不满足预设显存开销的情况下,通过分析该第二计算图的优化方案,更新排列队列,继续在更新的排列顺序中搜索评分结果最优的计算图,并判断该计算图的显存开销是否满足预设显存开销,包括以下步骤S443至S446(图示未示出):Method 2: When the video memory cost of the second computing graph with the best scoring result does not meet the preset video memory cost, update the queue by analyzing the optimization scheme of the second computing graph, and continue to search for ratings in the updated sorting order Resulting in the optimal calculation graph, and judging whether the memory overhead of the calculation graph satisfies the preset memory overhead, including the following steps S443 to S446 (not shown):
步骤S443,响应于所述第一候选计算图所需的显存空间不满足所述预设显存开销,基于所述第一候选计算图的目标运行数据,对所述第一候选计算图进行调整,得到至少一个第三计算图。Step S443, in response to the fact that the video memory space required by the first candidate computation graph does not meet the preset video memory overhead, adjust the first candidate computation graph based on the target operating data of the first candidate computation graph, At least one third computation graph is obtained.
在一些实施例中,如果第一候选计算图所需的显存空间不满足预设显存开销,说明该第一候选计算图仍需继续优化。基于此,在第一候选计算图中分析目标运行数据对应的目标算子,从而通过在该第一候选计算图中移动目标算子,生成优化后的第三计算图。如果第一候选计算图为图3B中的第二计算图32,通过分析第二计算图32确定该第二计算图32中包括三个目标算子;分别在第二计算图32中移动每一目标算子,生成三个第三计算图;如图4B所示,分别为第三计算图41、42和43。In some embodiments, if the video memory space required by the first candidate computation graph does not meet the preset video memory overhead, it means that the first candidate computation graph still needs to be further optimized. Based on this, the target operator corresponding to the target operation data is analyzed in the first candidate computation graph, so that an optimized third computation graph is generated by moving the target operator in the first candidate computation graph. If the first candidate computation graph is the second computation graph 32 in FIG. 3B, by analyzing the second computation graph 32, it is determined that the second computation graph 32 includes three target operators; move each target in the second computation graph 32 respectively operator to generate three third calculation graphs; as shown in FIG. 4B , they are the third calculation graphs 41 , 42 and 43 respectively.
步骤S444,基于所述至少一个第三计算图的评分结果更新所述排列队列,得到已更新的排列队列。Step S444, updating the permutation queue based on the scoring result of the at least one third computation graph, to obtain an updated permutation queue.
在一些实施例中,基于第一候选计算图,生成多个第三计算图之后。首先,在该排列队列中弹出该第一候选计算图,然后,按照步骤S401和步骤S402的方式,确定每一第三计算图的评分结果。最后,按照每一第三计算图的评分结果结合排列队列中每一第二计算图的评分结果,将至少一个第三计算图插入到该排列队列中,得到已更新的排列队列。如图4B所示,由于已经弹出第二计算图32,所以当前的排列队列中仅剩下第二计算图33,如果第三计算图41、42和43的评分结果中,第三计算图41的评分大于第二计算图33,第三计算图42和43的评分均小于第二计算图33,且第三计算图42大于第三计算图43;那么已更新的排列队列如图4B所示,按照评分从大到小的排列顺序依次为:第三计算图41、第二计算图33、第三计算图42和第三计算图43。In some embodiments, based on the first candidate computation graph, after generating a plurality of third computation graphs. Firstly, the first candidate computation graph is popped up in the arrangement queue, and then, according to the manner of step S401 and step S402, the scoring result of each third computation graph is determined. Finally, according to the scoring result of each third calculation graph combined with the scoring result of each second calculation graph in the arrangement queue, at least one third calculation graph is inserted into the arrangement queue to obtain an updated arrangement queue. As shown in Figure 4B, since the second calculation graph 32 has been popped up, only the second calculation graph 33 is left in the current queue. If the third calculation graph 41, 42 and 43 score results, the third The score of is greater than the second calculation graph 33, the scores of the third calculation graph 42 and 43 are both smaller than the second calculation graph 33, and the third calculation graph 42 is greater than the third calculation graph 43; then the updated alignment queue is shown in Figure 4B , in descending order of scores: the third calculation graph 41 , the second calculation graph 33 , the third calculation graph 42 and the third calculation graph 43 .
步骤S445,在所述已更新的排列队列中,搜索评分结果最优的第二候选计算图所需的显存空间是否满足所述预设显存开销。Step S445, in the updated queue, search whether the video memory space required by the second candidate computation graph with the best scoring result satisfies the preset video memory overhead.
在一些实施例中,在所述已更新的排列队列中,确定第二候选计算图所需的显存空间之后,判断该显存空间是否满足预设显存开销,如果第二候选计算图所需的显存空间不满足所述预设显存开销(比如,第二候选计算图所需的显存空间大于预设显存开销);那么继续基于该第二候选计算图的优化方案,生成新的计算图,并按照新的计算图的评分结果再次更新该已更新的排列队列。In some embodiments, in the updated queue, after determining the video memory space required by the second candidate computation graph, it is judged whether the video memory space meets the preset video memory overhead, if the video memory required by the second candidate computation graph is The space does not meet the preset video memory overhead (for example, the video memory space required by the second candidate computation graph is greater than the preset video memory overhead); then continue to generate a new computation graph based on the optimization scheme of the second candidate computation graph, and follow The scoring result of the new calculation graph updates the updated permutation queue again.
步骤S446,响应于第二候选计算图所需的显存空间满足所述预设显存开销,确定所述第二候选计算图为所述目标计算图。如此,基于排队序列,在评分结果最优的计算图所需的显存空间不满足预设显存开销的情况下,继续优化该计算图,在最新的排列队列中搜索评分最优的计算图所需的显存空间是否满足预设显存开销,这样经过多次搜索,能够使得搜索到的目标计算图的显存开销更优。Step S446, in response to the video memory space required by the second candidate computation graph meeting the preset video memory overhead, determine the second candidate computation graph as the target computation graph. In this way, based on the queuing sequence, when the memory space required by the calculation graph with the best scoring result does not meet the preset memory overhead, continue to optimize the calculation graph, and search for the calculation graph with the best score in the latest queue. Whether the video memory space meets the preset video memory overhead, so that after multiple searches, the video memory overhead of the searched target calculation graph can be made better.
方式三:在评分结果最优的计算图的显存开销不满足预设显存开销,且搜索计算图的搜索次数达到设定次数阈值的情况下,将最新的排列队列中评分结果最优的计算图作为目标计算图,包括以下步 骤S447(图示未示出):Method 3: When the memory cost of the calculation graph with the best scoring result does not meet the preset memory cost, and the number of searches for the calculation graph reaches the set number threshold, the latest calculation graph with the best scoring result is queued As the target calculation graph, the following step S447 (not shown in the figure) is included:
步骤S447,响应于所述第二候选计算图所需的显存空间不满足所述预设显存开销,且搜索次数达到预设次数阈值,确定末次搜索对应的排列队列中评分结果最优的计算图为所述目标计算图。Step S447, in response to the fact that the video memory space required by the second candidate computation graph does not meet the preset memory overhead, and the number of searches reaches the preset number threshold, determine the computation graph with the best scoring result in the queue corresponding to the last search Compute the graph for the target.
在一些实施例中,预设次数阈值可以是基于排列队列中计算图的数量设定的;比如,设定预设次数阈值小于排列队列中计算图数量的一半。如果第二候选计算图所需的显存空间不满足预设显存开销,且基于此该第二候选计算图更新队列之后,再次搜索到的评分最优的计算图所需的显存空间仍然不满足预设显存开销,那么在搜索次数达到预设次数阈值的情况下,确定最新的排列队列中评分最优的计算图为所述目标计算图。这样,在搜索不到满足预设显存开销的计算图的情况下,将最新的排列队列中评分最优的计算图作为目标计算图,从而使得搜索到的目标计算图是空间开销最优的计算图。In some embodiments, the preset times threshold may be set based on the number of computation graphs in the alignment queue; for example, the preset times threshold is set to be less than half of the number of computation graphs in the alignment queue. If the video memory space required by the second candidate computation graph does not meet the preset video memory overhead, and after updating the queue based on this second candidate computation graph, the video memory space required by the computation graph with the best score that is searched again still does not meet the preset memory cost. Assuming the memory overhead, when the number of searches reaches the preset number threshold, determine the calculation graph with the best score in the latest arrangement queue as the target calculation graph. In this way, in the case that the calculation graph that meets the preset memory overhead cannot be searched, the calculation graph with the best score in the latest arrangement queue is used as the target calculation graph, so that the searched target calculation graph is the calculation graph with the optimal space cost. picture.
在本公开实施例中,通过综合考虑第二计算图的运行时长以及显存峰值,对多个第二计算图进行排序,按照排序队列搜索目标计算图,从而使得搜索到的目标计算图是空间开销最优的计算图。In the embodiment of the present disclosure, by comprehensively considering the running time of the second calculation graph and the peak value of the video memory, the multiple second calculation graphs are sorted, and the target calculation graph is searched according to the sorted queue, so that the searched target calculation graph is space overhead Optimal Computational Graph.
在一些实施例中,在训练预设网络模型之前,可以通过确定该预设网络模型的目标计算图确定该预设网络模型所需的内存空间以及运行时间,即上述步骤S105可以通过以下过程实现:In some embodiments, before training the preset network model, the memory space and running time required by the preset network model can be determined by determining the target calculation graph of the preset network model, that is, the above step S105 can be achieved through the following process :
将所述目标计算图所需的显存空间,确定为训练所述预设网络模型所需的显存空间。The video memory space required by the target computation graph is determined as the video memory space required for training the preset network model.
在本公开实施例中,在实际情景中训练该网络模型时,通过运行该目标计算图,得到该目标计算图所需的显存空间,以及运行时长;将该目标计算图所需的显存空间,以及运行时长作为预估的训练所述预设网络模型所需的显存空间和运行时长,从而能够优化训练该网络模型所需的显存空间,并且不会牺牲训练该网络模型所需的运行时长。In the embodiment of the present disclosure, when training the network model in an actual scenario, by running the target computing graph, the video memory space required by the target computing graph and the running time are obtained; the video memory space required by the target computing graph, And the running time is used as the estimated video memory space and running time required for training the preset network model, so that the video memory space required for training the network model can be optimized without sacrificing the running time required for training the network model.
下面,将说明本公开实施例在一个实际的应用场景中的示例性应用,针对大规模的深度神经网络,以基于该神经网络的计算图实现对该神经网络占据显存的优化为例,进行说明。In the following, an exemplary application of an embodiment of the present disclosure in an actual application scenario will be described. For a large-scale deep neural network, the optimization of the memory occupied by the neural network based on the calculation graph of the neural network will be described as an example. .
在一些实施例中,在进行残差网络269(ResNeSt269)(包括1亿网络参数)的图像网络(ImageNet)训练时,显存就已经逼近了V100 32吉比特(GB)的上限,训练占用达到28GB。当模型进一步加大,或者加大批大小(Batch size)后,模型训练的显存占用也会随之增长,最后占用的显存高于显卡的显存容量,触碰到了显存墙,使得模型无法训练。如图5所示,图5为本公开实施例提供的计算图占据显存的曲线变化示意图,其中,横坐标表示计算图中各算子的执行顺序,纵坐标表示在算子执行过程中占据的显存空间的大小。曲线501表示内存申请量,曲线502表示计算图在执行一个任务的过程中不同时刻的内存缓存量,从曲线502可以看出,该计算图在前馈阶段结束时达到峰值点503。该峰值点503超过了曲线501的峰值,既该计算图占用的显存高于显卡的显存容量,使得模型训练无法继续进行。在这种情况下,显存优化就显得尤为重要。在众多显存优化方法中,基于计算图分析的显存优化方法是其中的一种。在相关技术中,计算图分析并优化的方法中,往往只是对算子进行简单的移动来初步降低显存占用。通常这类方法只是作为一种初步的显存优化,将重心放在后续的更进一步的显存优化方法中。这类方法只能优化少许显存占用,且缺少一个完整的优化体系。In some embodiments, when performing image network (ImageNet) training of residual network 269 (ResNeSt269) (including 100 million network parameters), the video memory has already approached the upper limit of V100 32 gigabits (GB), and the training occupation reaches 28GB . When the model is further enlarged, or the batch size is increased, the video memory usage of model training will also increase accordingly, and finally the video memory occupied is higher than the video memory capacity of the graphics card, touching the video memory wall, making the model unable to train. As shown in Figure 5, Figure 5 is a schematic diagram of the curve change of the calculation graph occupied by the video memory provided by the embodiment of the present disclosure, where the abscissa indicates the execution sequence of each operator in the calculation graph, and the ordinate indicates the occupied memory during the operator execution process. The size of the memory space. Curve 501 represents the memory application amount, and curve 502 represents the memory cache amount at different moments during the execution of a task in the calculation graph. It can be seen from the curve 502 that the calculation graph reaches the peak point 503 at the end of the feed-forward phase. The peak point 503 exceeds the peak value of the curve 501, that is, the video memory occupied by the calculation graph is higher than the video memory capacity of the graphics card, so that the model training cannot continue. In this case, memory optimization is particularly important. Among many video memory optimization methods, the video memory optimization method based on computational graph analysis is one of them. In related technologies, in the calculation graph analysis and optimization method, operators are often simply moved to initially reduce video memory usage. Usually this kind of method is only used as a preliminary video memory optimization, and the focus is placed on subsequent further video memory optimization methods. This type of method can only optimize a small amount of video memory usage, and lacks a complete optimization system.
基于此,本公开实施例提供一种显存优化方法,首先,通过常规深度学习框架(tensorflow、pytorch等)以及框架parrots将网络模型生成计算图。然后,通过计算图,可以将大的训练任务拆解成一个个算子(Task),每个算子都会使用原有数据(算子的输入)以及产生新的数据(算子的输出)。计算图中也给出了算子相关的各操作数所占的空间以及算子进行计算所需要的时间,从而可以通过分析计算图来优化显存的占用。这样,以降低内存占用峰值为目标,同时考虑移动算子时峰值转移的情况,能够得到最优计算图;而且在评定计算图的好坏时,提出了一个估价函数,而不只考虑其所占据显存这一个因素,从而能够综合考虑其对应的时间消耗。Based on this, an embodiment of the present disclosure provides a video memory optimization method. First, a network model is generated into a calculation graph through a conventional deep learning framework (tensorflow, pytorch, etc.) and framework parrots. Then, through the calculation graph, the large training task can be disassembled into individual operators (Task), each operator will use the original data (operator input) and generate new data (operator output). The calculation graph also shows the space occupied by each operand related to the operator and the time required for the operator to perform calculations, so that the memory usage can be optimized by analyzing the calculation graph. In this way, with the goal of reducing the peak value of memory usage, and considering the peak shift when moving operators, the optimal calculation graph can be obtained; and when evaluating the quality of the calculation graph, an evaluation function is proposed instead of only considering its occupation. Video memory is a factor, so that its corresponding time consumption can be considered comprehensively.
本公开实施例提供的显存优化方法的实现过程如图6所示,图6为本公开实施例提供的显存优化方法的实现流程示意图,结合图6所示的步骤进行以下说明:The implementation process of the video memory optimization method provided by the embodiment of the present disclosure is shown in Figure 6, which is a schematic flow diagram of the implementation process of the video memory optimization method provided by the embodiment of the present disclosure, and the following description is made in conjunction with the steps shown in Figure 6:
步骤S601,从JSON文件中读取计算图信息。Step S601, read the calculation graph information from the JSON file.
在一些实施例中,首先利用机器学习框架(tensorflow、pytorch和parrots)将需要训练的模型生成JSON格式的计算图,并存入JSON文件中;然后,启动显存优化流程,读入JSON格式的计算图。在显存优化流程中,定义计算图对象任务(schedule),每一个计算图对象在显存优化流程中就代表一 张不同的计算图。In some embodiments, first use the machine learning framework (tensorflow, pytorch, and parrots) to generate a calculation graph in JSON format for the model that needs to be trained, and store it in the JSON file; then, start the video memory optimization process and read the calculation in JSON format picture. In the memory optimization process, define the calculation graph object task (schedule), and each calculation graph object represents a different calculation graph in the memory optimization process.
步骤S602,基于读取的计算图信息,生成对应的计算图对象。Step S602, based on the read calculation graph information, generate a corresponding calculation graph object.
在一些实施例中,生成一个新的计算图对象,获取从JSON读入的信息,按次序编排算子队列,并将算子的输入输出依次与操作数进行匹配。In some embodiments, a new computation graph object is generated, the information read from JSON is obtained, the operator queue is arranged in order, and the input and output of the operator are matched with the operands in sequence.
步骤S603,分析当前的计算图对象是否满足显存空间花销。Step S603, analyzing whether the current calculation graph object meets the memory space cost.
在一些实施例中,分析当前计算图对象的最大空间开销以及花费的计算时间,包括:In some embodiments, analyzing the maximum space overhead of the current calculation graph object and the calculation time spent include:
(1)通过对计算图对象中所有算子的计算时间进行累加,得到总的计算时间。(1) By accumulating the calculation time of all operators in the calculation graph object, the total calculation time is obtained.
(2)通过遍历所有算子以及它们的输入输出获取计算图的拓扑结构以及显存开销。如果该计算图对象已满足空间开销,进入步骤S604,无需对其优化。如果该计算图对象不满足空间开销,进入步骤S605。(2) Obtain the topology structure and memory overhead of the calculation graph by traversing all operators and their input and output. If the computation graph object already meets the space cost, go to step S604 and do not need to optimize it. If the computation graph object does not satisfy the space cost, go to step S605.
步骤S605,基于计算图对象的拓扑结构,寻找可以优化的位置,并将基于该位置生成的新的计算图对象加入优先队列。Step S605, based on the topological structure of the computation graph object, search for a position that can be optimized, and add the new computation graph object generated based on the position into the priority queue.
在一些实施例中,寻找可以优化的位置的实现过程如下:首先,找出内存达到峰值的时间点;然后,以峰值为分界点,寻找在峰值前生成却在峰值后才得以使用的数据,该数据所在的位置即为可以优化的位置。In some embodiments, the implementation process of finding the location that can be optimized is as follows: first, find out the time point when the memory reaches the peak value; then, using the peak value as the dividing point, find the data that was generated before the peak value but used after the peak value, Where this data resides is where it can be optimized.
在一些可能的实现方式中,可以将对应生成数据的算子移动到峰值后以达到降低峰值的目的。将满足这一条件的所有算子作为当前计算图对象的可优化方案。结合当前计算图对象的可优化方案,生成一系列新的计算图对象,将生成的新的计算图对象作为元素加入到优先队列。In some possible implementation manners, the operator corresponding to the generated data may be moved behind the peak value to achieve the purpose of reducing the peak value. All operators satisfying this condition are regarded as the optimized scheme of the current computation graph object. Combine the optimization scheme of the current calculation graph object to generate a series of new calculation graph objects, and add the generated new calculation graph objects as elements to the priority queue.
在一些可能的实现方式中,该队列是以新的计算图对象的评分为依据形成的,采用以下公式确定每一计算图的评分Score:In some possible implementations, the queue is formed based on the scores of new computation graph objects, and the following formula is used to determine the score Score of each computation graph:
Score=MEMORY_FACTOR*(peak_memory-limit)/limit+(1-MEMORY_FACTOR)*(total_time-origin_time)/origin_time;Score=MEMORY_FACTOR*(peak_memory-limit)/limit+(1-MEMORY_FACTOR)*(total_time-origin_time)/origin_time;
其中,peak_memory表示该计算图的峰值显存占用,limit表示我们设定的显存开销预算,total_time表示该计算图所对应的执行时间,origin_time表示初始计算图对应的执行时间。MEMORY_FACTOR是一个权衡时间和空间所占得分比重的参数。Among them, peak_memory indicates the peak memory usage of the calculation graph, limit indicates the memory overhead budget we set, total_time indicates the execution time corresponding to the calculation graph, and origin_time indicates the execution time corresponding to the initial calculation graph. MEMORY_FACTOR is a parameter that weighs the proportion of time and space in the score.
步骤S606,将优先队列的第一个元素整合成新的计算图对象,判断该新的计算图对象是否满足空间开销。Step S606, integrating the first element of the priority queue into a new computation graph object, and judging whether the new computation graph object meets the space overhead.
在一些实施例中,弹出优先队列的第一个元素(既得分最好的计算图对象),判断其是否满足空间开销,如果满足,则进入步骤S607。如果不满足,则返回步骤S605继续进行下一次搜索,如果搜索的计算图对象仍然不满足空间开销,且搜索次数达到预设的上限,进入步骤S608。In some embodiments, the first element of the priority queue (that is, the computation graph object with the best score) is popped out, and it is judged whether it satisfies the space cost, and if so, proceeds to step S607. If not, return to step S605 to continue the next search. If the searched computation graph object still does not meet the space cost and the number of searches reaches the preset upper limit, go to step S608.
步骤S607,将满足条件的计算图对象作为输出并将其保存成JSON文件。Step S607, take the computation graph object satisfying the condition as an output and save it as a JSON file.
步骤S608,终止搜索,并以当前优先队列第一个元素作为最优的计算图。Step S608, the search is terminated, and the first element of the current priority queue is used as the optimal calculation graph.
上述步骤S601至步骤S608,给出了一个优先队列的回溯搜索最优的计算图的搜索策略:首先,以显存达到峰值的时刻为界限,寻找在峰值前生成却要到峰值后才使用的数据,并将这样的数据对应的算子所在的位置作为可优化的位置;其次,将基于该位置优化后的计算图,计算评分,并按照得分加入优先队列进行搜索;同时考虑峰值转移的情况,即每次用优化方案生成新的计算图,都会进行峰值显存占用分析。最后,如果找不到满足要求的计算图,则在最后返回当前搜索到的评分最好的计算图。The above steps S601 to S608 provide a search strategy for backtracking and searching the optimal calculation graph of a priority queue: First, use the time when the video memory reaches the peak value as the limit, and search for data that was generated before the peak but used after the peak , and take the position of the operator corresponding to such data as the position that can be optimized; secondly, calculate the score based on the optimized calculation graph of the position, and add it to the priority queue for searching according to the score; at the same time, considering the peak transfer situation, That is, every time a new calculation graph is generated with the optimization scheme, the peak video memory usage analysis will be performed. Finally, if no calculation graph that meets the requirements is found, the currently searched calculation graph with the best score will be returned at the end.
在本公开实施例中,用户在实际情境中训练模型时,可以先使用本公开实施例提供的显存优化方法对模型的计算图进行分析并优化。用户可以根据计算图的情况对模型的内存空间以及时间有一个大致的了解。在本公开实施例中,计算图优化是在开始训练前就完成的,结合后续的更多优化,综合的内存优化将是非常可观的。In the embodiment of the present disclosure, when the user trains the model in an actual situation, the calculation graph of the model may be analyzed and optimized using the video memory optimization method provided by the embodiment of the present disclosure. Users can have a general understanding of the memory space and time of the model based on the calculation graph. In the embodiment of the present disclosure, the calculation graph optimization is completed before starting the training, combined with more subsequent optimizations, the comprehensive memory optimization will be very considerable.
在一个具体例子中,在给出的计算图样例(pattern.json)中,初始计算图显存占用峰值为3.38吉比特(GiB)耗时129.03毫秒(ms),最优的计算图显存峰值为1.72GiB耗时136.09ms,在这种情况下,显存的优化率达到了49%。随着训练模型的增大以及批大小的增加,计算图的可优化空间也将变 大,这时,本公开实施例提供的显存优化方法的显存优化效果将进一步增加。这样,可以大大降低大规模深度学习所占用的显存大小,进而大幅度在空间开销上降低大规模训练的成本,即使没能找到满足设定空间开销的计算图,该方法也能给出当前条件下综合评分最优的计算图供用户参考。而且在计算图评分环节加入计算时间成本成本的考量,使最后得到的计算图不至于为了优化少许显存空间而牺牲大量时间成本。In a specific example, in the given calculation graph sample (pattern.json), the initial calculation graph memory usage peak value is 3.38 gigabits (GiB) and takes 129.03 milliseconds (ms), and the optimal calculation graph memory usage peak value is 1.72GiB took 136.09ms. In this case, the optimization rate of video memory reached 49%. As the training model increases and the batch size increases, the optimization space of the calculation graph will also increase. At this time, the video memory optimization method provided by the embodiments of the present disclosure will further increase the video memory optimization effect. In this way, the memory size occupied by large-scale deep learning can be greatly reduced, and the cost of large-scale training can be greatly reduced in terms of space overhead. Even if a calculation graph that meets the set space overhead cannot be found, this method can also give the current conditions. The calculation diagram with the best comprehensive score is shown below for user reference. In addition, the consideration of calculation time and cost is added to the calculation graph scoring process, so that the final calculation graph will not sacrifice a lot of time cost in order to optimize a small amount of video memory space.
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。Those skilled in the art can understand that in the above method of specific implementation, the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process. The specific execution order of each step should be based on its function and possible The inner logic is OK.
基于同一发明构思,本公开实施例中还提供了与显存优化方法对应的显存优化装置,由于本公开实施例中的装置解决问题的原理与本公开实施例上述显存优化方法相似,因此装置的实施可以参见方法的实施。Based on the same inventive concept, the embodiment of the present disclosure also provides a video memory optimization device corresponding to the video memory optimization method. Since the problem-solving principle of the device in the embodiment of the present disclosure is similar to the above-mentioned video memory optimization method of the embodiment of the present disclosure, the implementation of the device See the implementation of the method.
本公开实施例提供一种显存优化装置,图7为本公开实施例显存优化装置的结构组成示意图,如图7所示,所述显存优化装置700包括:An embodiment of the present disclosure provides a video memory optimization device. FIG. 7 is a schematic diagram of the structural composition of the video memory optimization device according to an embodiment of the present disclosure. As shown in FIG. 7 , the video memory optimization device 700 includes:
第一生成部分701,被配置为基于预设网络模型,生成第一计算图;The first generation part 701 is configured to generate a first calculation graph based on a preset network model;
第一确定部分702,被配置为确定所述第一计算图的显存峰值与运行数据之间的关联关系;The first determination part 702 is configured to determine the correlation between the peak value of the video memory of the first calculation graph and the running data;
第二生成部分703,被配置为基于所述关联关系,对所述第一计算图进行调整,生成至少一个第二计算图;The second generation part 703 is configured to adjust the first calculation graph based on the association relationship, and generate at least one second calculation graph;
第二确定部分704,被配置为基于所述至少一个第二计算图的显存峰值和运行时长,在所述至少一个第二计算图中确定目标计算图;The second determining part 704 is configured to determine a target calculation graph in the at least one second calculation graph based on the peak value of the video memory and the running time of the at least one second calculation graph;
第三确定部分705,被配置为基于所述目标计算图,确定所述预设网络模型所需的显存空间。The third determining part 705 is configured to determine the video memory space required by the preset network model based on the target computation graph.
在一些实施例中,所述第一生成部分701,包括:In some embodiments, the first generating part 701 includes:
第一生成子部分,被配置为基于所述预设网络模型,生成数据交换格式的计算图信息;The first generation subpart is configured to generate calculation graph information in a data exchange format based on the preset network model;
第二生成子部分,被配置为基于所述计算图信息中的算子队列,生成所述计算图信息匹配的第一计算图。The second generation subpart is configured to generate a first computation graph matched by the computation graph information based on the operator queue in the computation graph information.
在一些实施例中,所述第一确定部分702,包括:In some embodiments, the first determining part 702 includes:
第一确定子部分,被配置为确定所述第一计算图中显存峰值的出现时刻;The first determination subpart is configured to determine the occurrence moment of the video memory peak in the first calculation graph;
第二确定子部分,被配置为确定所述第一计算图中算子的运行数据;The second determination subpart is configured to determine the operation data of the operator in the first calculation graph;
第三确定子部分,被配置为确定所述运行数据的生成时刻和所述运行数据在所述第一计算图中的应用时刻;确定所述生成时刻和所述应用时刻,与所述显存峰值的出现时刻之间的时序关系,为所述关联关系。The third determination subpart is configured to determine the generation time of the operation data and the application time of the operation data in the first calculation graph; determine the generation time and the application time, and the peak value of the video memory The timing relationship between the occurrence moments of is the association relationship.
在一些实施例中,所述第二生成部分703,包括:In some embodiments, the second generating part 703 includes:
第四确定子部分,被配置为在所述第一计算图中,确定所述关联关系满足预设条件的目标运行数据;The fourth determination subpart is configured to determine, in the first calculation graph, the target operating data whose association relationship satisfies a preset condition;
第一调整子部分,被配置为基于所述目标运行数据,对所述第一计算图进行调整,生成所述至少一个第二计算图。The first adjustment subpart is configured to adjust the first calculation graph based on the target operation data to generate the at least one second calculation graph.
在一些实施例中,所述第四确定子部分,包括:In some embodiments, the fourth determining subsection includes:
第一确定单元,被配置为在所述第一计算图中的算子的运行数据中,确定生成时刻在所述显存峰值的出现时间之前且应用时刻在所述显存峰值的出现时刻之后的运行数据,为满足所述预设条件的目标运行数据。The first determining unit is configured to, in the operation data of the operator in the first calculation graph, determine the operation whose generation time is before the occurrence time of the video memory peak value and whose application time is after the occurrence time of the video memory peak value The data is the target operation data satisfying the preset condition.
在一些实施例中,所述第一调整子部分,包括:In some embodiments, the first adjustment subsection includes:
第二确定单元,被配置为在所述第一计算图中,确定所述目标运行数据对应的目标算子;The second determination unit is configured to determine a target operator corresponding to the target operation data in the first calculation graph;
第一调整单元,被配置为基于所述第一计算图中所述显存峰值的出现时刻,调整所述第一计算图中的所述目标算子,生成所述至少一个第二计算图。The first adjustment unit is configured to adjust the target operator in the first computation graph based on the occurrence time of the video memory peak in the first computation graph to generate the at least one second computation graph.
在一些实施例中,所述第一调整单元,还被配置为:In some embodiments, the first adjustment unit is further configured to:
在所述第一计算图中,将所述目标算子的执行时刻调整至所述显存峰值的出现时刻之后,生成所述第二计算图。In the first computation graph, the second computation graph is generated after the execution time of the target operator is adjusted to the occurrence time of the video memory peak.
在一些实施例中,所述第二确定部分704,包括:In some embodiments, the second determining part 704 includes:
第一获取子部分,被配置为获取预设显存开销和预设权衡比值;其中,所述预设权衡比值用于权衡计算图的运行时长和所需显存之间的比重;The first acquisition subpart is configured to acquire a preset video memory overhead and a preset trade-off ratio; wherein the preset trade-off ratio is used to weigh the ratio between the running time of the calculation graph and the required video memory;
第一评分子部分,被配置为基于所述预设显存开销和预设权衡比值,对所述每一第二计算图的显存峰值、运行时长和所述对应的第一计算图的运行时长进行评分,得到所述每一第二计算图的评分结果;The first scoring sub-section is configured to perform an evaluation on the peak value of video memory, the running time of each second computing graph, and the running time of the corresponding first computing graph based on the preset video memory overhead and the preset trade-off ratio Scoring, obtaining the scoring result of each of the second calculation graphs;
第一排序子部分,被配置为基于所述每一第二计算图的评分结果,对所述至少一个第二计算图中的第二计算图进行排序,得到排序队列;The first sorting subpart is configured to sort the second calculation graphs in the at least one second calculation graph based on the scoring results of each second calculation graph to obtain a sorting queue;
第五确定子部分,被配置为基于所述排序队列,在所述至少一个第二计算图中确定所述目标计算图。The fifth determining subpart is configured to determine the target computation graph in the at least one second computation graph based on the sort queue.
在一些实施例中,所述第一评分子部分,包括:In some embodiments, the first scoring subsection includes:
第三确定单元,被配置为基于所述每一第二计算图的显存峰值、所述预设显存开销和预设权衡比值,确定所述每一第二计算图的显存评分;The third determining unit is configured to determine the video memory score of each second computing graph based on the video memory peak value of each second computing graph, the preset video memory overhead, and the preset trade-off ratio;
第一评分单元,被配置为基于所述预设权衡比值、所述每一第二计算图的运行时长和所述对应的第一计算图的运行时长,确定所述每一第二计算图的运行时长评分;The first scoring unit is configured to determine, based on the preset trade-off ratio, the running time of each second computing graph, and the corresponding running time of the first computing graph, the value of each second computing graph runtime rating;
第四确定单元,被配置为基于所述每一第二计算图的显存评分和所述运行时长评分,确定所述每一第二计算图像的评分结果。The fourth determination unit is configured to determine the scoring result of each second calculation image based on the video memory score and the runtime score of each second calculation graph.
在一些实施例中,所述第五确定子部分,包括:In some embodiments, the fifth determining subsection includes:
第一搜索单元,被配置为在所述排列队列中,搜索评分结果最优的第一候选计算图;The first search unit is configured to search for the first candidate computation graph with the best scoring result in the arrangement queue;
第五确定单元,被配置为响应于搜索到的所述第一候选计算图所需的显存空间满足所述预设显存开销,确定所述第一候选计算图为所述目标计算图。The fifth determining unit is configured to determine the first candidate computation graph as the target computation graph in response to the searched video memory space required by the first candidate computation graph meeting the preset video memory overhead.
在一些实施例中,所述第五确定子部分,包括:In some embodiments, the fifth determining subsection includes:
第二调整单元,被配置为响应于所述第一候选计算图所需的显存空间不满足所述预设显存开销,基于所述第一候选计算图的目标运行数据,对所述第一候选计算图进行调整,得到至少一个第三计算图;The second adjustment unit is configured to, in response to the video memory space required by the first candidate computation graph not satisfying the preset video memory overhead, based on the target running data of the first candidate computation graph, The calculation graph is adjusted to obtain at least one third calculation graph;
第一更新单元,被配置为基于所述至少一个第三计算图的评分结果更新所述排列队列,得到已更新的排列队列;The first updating unit is configured to update the permutation queue based on the scoring result of the at least one third computation graph, to obtain an updated permutation queue;
第二搜索单元,被配置为在所述已更新的排列队列中,搜索评分结果最优的第二候选计算图所需的显存空间是否满足所述预设显存开销;The second search unit is configured to, in the updated queue, search whether the video memory space required by the second candidate computation graph with the best scoring result satisfies the preset video memory overhead;
第六确定单元,被配置为响应于所述第二候选计算图所需的显存空间不满足所述预设显存开销,且搜索次数达到预设次数阈值,确定末次搜索对应的排列队列中评分结果最优的计算图为所述目标计算图。The sixth determining unit is configured to determine the scoring result in the queue corresponding to the last search in response to the fact that the video memory space required by the second candidate computation graph does not meet the preset video memory overhead, and the number of searches reaches a preset number threshold The optimal computation graph is the target computation graph.
在一些实施例中,所述第三确定部分705,还被配置为:In some embodiments, the third determining part 705 is further configured to:
将所述目标计算图所需的显存空间,确定为训练所述预设网络模型所需的显存空间。The video memory space required by the target computation graph is determined as the video memory space required for training the preset network model.
在本公开实施例以及其他的实施例中,“部分”可以是部分电路、部分处理器、部分程序或软件等等,当然也可以是单元,还可以是模块也可以是非模块化的。In the embodiments of the present disclosure and other embodiments, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.
需要说明的是,以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本公开装置实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。It should be noted that the description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.
需要说明的是,本公开实施例中,如果以软件功能模块的形式实现上述的显存优化方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是终端、服务器等)执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、运动硬盘、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本公开实施 例不限制于任何特定的硬件和软件结合。It should be noted that, in the embodiments of the present disclosure, if the above video memory optimization method is implemented in the form of software function modules and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the prior art can be embodied in the form of software products, the computer software products are stored in a storage medium, including several instructions for A computer device (which may be a terminal, a server, etc.) is made to execute all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: various media that can store program codes such as U disk, sports hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. As such, the disclosed embodiments are not limited to any specific combination of hardware and software.
本公开实施例再提供一种计算机程序产品,所述计算机程序产品包括计算机可执行指令,该计算机可执行指令被执行后,能够实现本公开实施例提供的显存优化方法。The embodiment of the present disclosure further provides a computer program product, the computer program product includes computer-executable instructions, and after the computer-executable instructions are executed, the video memory optimization method provided in the embodiments of the present disclosure can be implemented.
本公开实施例再提供一种计算机存储介质,所述计算机存储介质上存储有计算机可执行指令,所述该计算机可执行指令被处理器执行时实现上述实施例提供的显存优化方法。The embodiments of the present disclosure further provide a computer storage medium, where computer executable instructions are stored on the computer storage medium, and when the computer executable instructions are executed by a processor, the video memory optimization method provided in the foregoing embodiments is implemented.
本公开实施例提供一种计算机设备,图8为本公开实施例计算机设备的组成结构示意图,如图8所示,所述计算机设备800包括:一个处理器801、至少一个通信总线、通信接口802、至少一个外部通信接口和存储器803。其中,通信接口802配置为实现这些组件之间的连接通信。其中,通信接口802可以包括显示屏,外部通信接口可以包括标准的有线接口和无线接口。其中所述处理器801,配置为执行存储器中显存优化程序,以实现上述实施例提供的显存优化方法。An embodiment of the present disclosure provides a computer device. FIG. 8 is a schematic diagram of the composition and structure of a computer device in an embodiment of the present disclosure. As shown in FIG. 8 , the computer device 800 includes: a processor 801, at least one communication bus, and a communication interface 802 , at least one external communication interface and memory 803 . Wherein, the communication interface 802 is configured to realize connection and communication between these components. Wherein, the communication interface 802 may include a display screen, and the external communication interface may include a standard wired interface and a wireless interface. Wherein the processor 801 is configured to execute a video memory optimization program in the memory, so as to implement the video memory optimization method provided in the foregoing embodiments.
以上显存优化装置、计算机设备和存储介质实施例的描述,与上述方法实施例的描述是类似的,具有同相应方法实施例相似的技术描述和有益效果,限于篇幅,可案件上述方法实施例的记载,故在此不再赘述。对于本公开显存优化装置、计算机设备和存储介质实施例中未披露的技术细节,请参照本公开方法实施例的描述而理解。The above descriptions of the memory optimization device, computer equipment, and storage medium embodiments are similar to the descriptions of the above-mentioned method embodiments, and have similar technical descriptions and beneficial effects as the corresponding method embodiments. records, so I will not repeat them here. For the technical details not disclosed in the embodiments of the video memory optimization device, computer equipment, and storage medium of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.
本公开实施例中涉及的设备可以是系统、方法和计算机程序产品中的至少之一。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。The device involved in the embodiments of the present disclosure may be at least one of a system, a method, and a computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present disclosure.
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、可擦除可编程只读存储器(Electrical Programmable Read Only Memory,EPROM)或闪存、静态随机存取存储器(Static Random-Access Memory,SRAM)、便携式压缩盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、数字多功能盘(Digital Video Disc,DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. Examples of computer-readable storage media (a non-exhaustive list) include: portable computer disks, hard disks, Random Access Memory (RAM), Read-Only Memory (ROM), erasable Electrical Programmable Read Only Memory (EPROM) or flash memory, Static Random-Access Memory (Static Random-Access Memory, SRAM), Portable Compact Disc Read-Only Memory (CD-ROM), Digital Video Discs (DVDs), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和无线网中的至少之一下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和边缘服务器中的至少之一。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device over at least one of a network, such as the Internet, a local area network, a wide area network, and a wireless network. . The network may include at least one of copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(Industry Standard Architecture,ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言,诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络,包括局域网(Local Area Network,LAN)或广域网(Wide Area Network,WAN)连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、FPGA或可编程逻辑阵列(Programmable Logic Arrays,PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。Computer program instructions for performing the operations of the present disclosure may be assembly instructions, Industry Standard Architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages, such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it may be connected to an external computer (for example, using Internet Service Provider to connect via the Internet). In some embodiments, electronic circuits, such as programmable logic circuits, FPGAs, or programmable logic arrays (Programmable Logic Arrays, PLAs), can be customized by using state information of computer-readable program instructions, which can execute computer-readable Read program instructions, thereby implementing various aspects of the present disclosure.
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本公开的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中” 或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本公开的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本公开实施例的实施过程构成任何限定。上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic related to the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that in various embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, rather than by the embodiments of the present disclosure. The implementation process constitutes any limitation. The serial numbers of the above-mentioned embodiments of the present disclosure are for description only, and do not represent the advantages and disadvantages of the embodiments. It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.
在本公开所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided in the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined, or May be integrated into another system, or some features may be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本公开各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器、磁碟或者光盘等各种可以存储程序代码的介质。或者,本公开上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may be used as a single unit, or two or more units may be integrated into one unit; the above-mentioned integration The unit can be realized in the form of hardware or in the form of hardware plus software functional unit. Those of ordinary skill in the art can understand that all or part of the steps to realize the above method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in computer-readable storage media. When the program is executed, the execution includes The steps in the foregoing method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes such as removable storage devices, read-only memories, magnetic disks or optical disks. Alternatively, if the above-mentioned integrated units of the present disclosure are implemented in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the prior art can be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes several instructions for Make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks. The above is only a specific implementation of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope of the present disclosure. should fall within the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be determined by the protection scope of the claims.
工业实用性Industrial Applicability
本公开实施例提供一种显存优化方法、装置、设备、存储介质及程序产品,其中,所述方法包括:基于预设网络模型,生成第一计算图;确定所述第一计算图的显存峰值与运行数据之间的关联关系;基于所述关联关系,对所述第一计算图进行调整,生成至少一个第二计算图;基于所述至少一个第二计算图的显存峰值和运行时长,在所述至少一个第二计算图中确定目标计算图;基于所述目标计算图,确定所述预设网络模型所需的显存空间。Embodiments of the present disclosure provide a video memory optimization method, device, device, storage medium, and program product, wherein the method includes: generating a first calculation graph based on a preset network model; determining a video memory peak value of the first calculation graph The association relationship with the running data; based on the association relationship, adjust the first calculation graph to generate at least one second calculation graph; based on the video memory peak value and runtime of the at least one second calculation graph, in A target computation graph is determined in the at least one second computation graph; based on the target computation graph, the display memory space required by the preset network model is determined.

Claims (20)

  1. 一种显存优化方法,所述方法包括:A video memory optimization method, said method comprising:
    基于预设网络模型,生成第一计算图;generating a first calculation graph based on a preset network model;
    确定所述第一计算图的显存峰值与运行数据之间的关联关系;Determine the correlation between the peak value of the video memory of the first calculation graph and the running data;
    基于所述关联关系,对所述第一计算图进行调整,生成至少一个第二计算图;Adjusting the first calculation graph based on the association relationship to generate at least one second calculation graph;
    基于所述至少一个第二计算图的显存峰值和运行时长,在所述至少一个第二计算图中确定目标计算图;Determining a target computation graph in the at least one second computation graph based on the video memory peak value and runtime of the at least one second computation graph;
    基于所述目标计算图,确定所述预设网络模型所需的显存空间。Based on the target computation graph, the video memory space required by the preset network model is determined.
  2. 根据权利要求1所述的方法,其中,所述基于预设网络模型,生成第一计算图,包括:The method according to claim 1, wherein said generating a first calculation graph based on a preset network model comprises:
    基于所述预设网络模型,生成数据交换格式的计算图信息;Generate calculation graph information in a data exchange format based on the preset network model;
    基于所述计算图信息中的算子队列,生成所述计算图信息匹配的第一计算图。Based on the operator queue in the computation graph information, a first computation graph matching the computation graph information is generated.
  3. 根据权利要求1或2所述的方法,其中,所述确定所述第一计算图的显存峰值与运行数据之间的关联关系,包括:The method according to claim 1 or 2, wherein said determining the correlation between the peak value of the video memory of the first calculation graph and the running data comprises:
    确定所述第一计算图中显存峰值的出现时刻;Determining the occurrence moment of the video memory peak in the first calculation graph;
    确定所述第一计算图中算子的运行数据;determining the operation data of the operator in the first calculation graph;
    确定所述运行数据的生成时刻和所述运行数据在所述第一计算图中的应用时刻;determining the generation time of the operation data and the application time of the operation data in the first calculation graph;
    确定所述生成时刻和所述应用时刻,与所述显存峰值的出现时刻之间的时序关系,为所述关联关系。A timing relationship between the generation time and the application time and the occurrence time of the video memory peak value is determined as the association relationship.
  4. 根据权利要求1至3任一项所述的方法,其中,所述基于所述关联关系,对所述第一计算图进行调整,生成至少一个第二计算图,包括:The method according to any one of claims 1 to 3, wherein, based on the association relationship, adjusting the first calculation graph to generate at least one second calculation graph includes:
    在所述第一计算图中,确定所述关联关系满足预设条件的目标运行数据;In the first calculation diagram, determine that the association relationship satisfies the target operating data of the preset condition;
    基于所述目标运行数据,对所述第一计算图进行调整,生成所述至少一个第二计算图。Based on the target operation data, the first calculation graph is adjusted to generate the at least one second calculation graph.
  5. 根据权利要求4所述的方法,其中,所述在所述第一计算图中,确定所述关联关系满足预设条件的目标运行数据,包括:The method according to claim 4, wherein, in the first calculation graph, determining that the association relationship meets the target operating data of a preset condition includes:
    在所述第一计算图中的算子的运行数据中,确定生成时刻在所述显存峰值的出现时间之前且应用时刻在所述显存峰值的出现时刻之后的运行数据,为满足所述预设条件的目标运行数据。In the operation data of the operators in the first calculation graph, determine the operation data whose generation time is before the occurrence time of the video memory peak and the application time is after the occurrence time of the video memory peak, in order to satisfy the preset Target run data for the condition.
  6. 根据权利要求4或5所述的方法,其中,所述基于所述目标运行数据,对所述第一计算图进行调整,生成所述至少一个第二计算图,包括:The method according to claim 4 or 5, wherein said adjusting said first calculation graph based on said target operation data to generate said at least one second calculation graph comprises:
    在所述第一计算图中,确定所述目标运行数据对应的目标算子;In the first calculation graph, determine a target operator corresponding to the target operating data;
    基于所述第一计算图中所述显存峰值的出现时刻,调整所述第一计算图中的所述目标算子,生成所述至少一个第二计算图。Based on the occurrence time of the video memory peak in the first computation graph, the target operator in the first computation graph is adjusted to generate the at least one second computation graph.
  7. 根据权利要求6所述的方法,其中,所述基于所述第一计算图中所述显存峰值的出现时刻,调整所述第一计算图中的所述目标算子,生成所述至少一个第二计算图,包括:The method according to claim 6, wherein, based on the occurrence time of the video memory peak value in the first calculation graph, the target operator in the first calculation graph is adjusted to generate the at least one first Two calculation graphs, including:
    在所述第一计算图中,将所述目标算子的执行时刻调整至所述显存峰值的出现时刻之后,生成所述第二计算图。In the first computation graph, the second computation graph is generated after the execution time of the target operator is adjusted to the occurrence time of the video memory peak.
  8. 根据权利要求1至7任一项所述的方法,其中,所述基于所述至少一个第二计算图的显存峰值和运行时长,在所述至少一个第二计算图中确定目标计算图,包括:The method according to any one of claims 1 to 7, wherein the determination of the target calculation graph in the at least one second calculation graph based on the peak value of the video memory and the running time of the at least one second computation graph includes :
    获取预设显存开销和预设权衡比值;其中,所述预设权衡比值用于权衡计算图的运行时长和所需显存之间的比重;Obtaining a preset video memory overhead and a preset trade-off ratio; wherein, the preset trade-off ratio is used to weigh the ratio between the running time of the calculation graph and the required video memory;
    基于所述预设显存开销和预设权衡比值,对所述每一第二计算图的显存峰值、运行时长和所述对应的第一计算图的运行时长进行评分,得到所述每一第二计算图的评分结果;Based on the preset video memory overhead and the preset trade-off ratio, score the video memory peak value, running time of each second computing graph, and the running time of the corresponding first computing graph, and obtain each second Calculation graph scoring results;
    基于所述每一第二计算图的评分结果,对所述至少一个第二计算图中的第二计算图进行排序,得 到排序队列;Based on the scoring results of each second computation graph, sort the second computation graphs in the at least one second computation graph to obtain a sorted queue;
    基于所述排序队列,在所述至少一个第二计算图中确定所述目标计算图。Based on the sorted queue, the target computation graph is determined in the at least one second computation graph.
  9. 根据权利要求8所述的方法,其中,所述基于所述预设显存开销和预设权衡比值,对所述每一第二计算图的显存峰值、运行时长和所述对应的第一计算图的运行时长进行评分,得到所述每一第二计算图的评分结果,包括:The method according to claim 8, wherein, based on the preset video memory overhead and the preset trade-off ratio, the video memory peak value, runtime of each second computing graph and the corresponding first computing graph The running time is scored, and the scoring results of each second calculation graph are obtained, including:
    基于所述每一第二计算图的显存峰值、所述预设显存开销和预设权衡比值,确定所述每一第二计算图的显存评分;Based on the video memory peak value of each second computing graph, the preset video memory overhead and the preset trade-off ratio, determine the video memory score of each second computing graph;
    基于所述预设权衡比值、所述每一第二计算图的运行时长和所述对应的第一计算图的运行时长,确定所述每一第二计算图的运行时长评分;determining a runtime score of each second computation graph based on the preset trade-off ratio, the runtime of each second computation graph, and the runtime of the corresponding first computation graph;
    基于所述每一第二计算图的显存评分和所述运行时长评分,确定所述每一第二计算图像的评分结果。A scoring result of each second computing image is determined based on the video memory score and the running time score of each second computing graph.
  10. 根据权利要求8或9所述的方法,其中,所述基于所述排序队列,在所述至少一个第二计算图中确定所述目标计算图,包括:The method according to claim 8 or 9, wherein said determining said target computation graph in said at least one second computation graph based on said sorting queue comprises:
    在所述至少一个第二计算图的所述排列队列中,搜索评分结果最优的第一候选计算图;In the arrangement queue of the at least one second computation graph, search for the first candidate computation graph with the best scoring result;
    响应于搜索到的所述第一候选计算图所需的显存空间满足所述预设显存开销,确定所述第一候选计算图为所述目标计算图。In response to the searched video memory space required by the first candidate computation graph meeting the preset video memory overhead, determine the first candidate computation graph as the target computation graph.
  11. 根据权利要求10所述的方法,其中,所述基于所述排序队列,在所述至少一个第二计算图中确定所述目标计算图,包括:The method according to claim 10, wherein said determining said target computation graph in said at least one second computation graph based on said sorting queue comprises:
    响应于所述第一候选计算图所需的显存空间不满足所述预设显存开销,基于所述第一候选计算图的目标运行数据,对所述第一候选计算图进行调整,得到至少一个第三计算图;In response to the fact that the video memory space required by the first candidate computing graph does not meet the preset video memory overhead, based on the target running data of the first candidate computing graph, the first candidate computing graph is adjusted to obtain at least one The third calculation graph;
    基于所述至少一个第三计算图的评分结果更新所述排列队列,得到已更新的排列队列;updating the permutation queue based on the scoring results of the at least one third calculation graph to obtain an updated permutation queue;
    在所述已更新的排列队列中,搜索评分结果最优的第二候选计算图所需的显存空间是否满足所述预设显存开销;In the updated queue, search whether the video memory space required by the second candidate calculation graph with the best scoring result satisfies the preset video memory overhead;
    响应于所述第二候选计算图所需的显存空间不满足所述预设显存开销,且搜索次数达到预设次数阈值,确定末次搜索对应的排列队列中评分结果最优的计算图为所述目标计算图。In response to the fact that the video memory space required by the second candidate computation graph does not meet the preset video memory overhead, and the number of searches reaches a preset threshold, it is determined that the computation graph with the best scoring result in the queue corresponding to the last search is the Target Computational Graph.
  12. 根据权利要求1至11任一项所述的方法,其中,所述基于所述目标计算图,确定所述预设网络模型所需的显存空间,包括:The method according to any one of claims 1 to 11, wherein the determining the video memory space required by the preset network model based on the target calculation graph includes:
    将所述目标计算图所需的显存空间,确定为训练所述预设网络模型所需的显存空间。The video memory space required by the target computation graph is determined as the video memory space required for training the preset network model.
  13. 一种显存优化装置,其中,所述装置包括:A video memory optimization device, wherein the device includes:
    第一生成模块,被配置为基于预设网络模型,生成第一计算图;The first generation module is configured to generate a first calculation graph based on a preset network model;
    第一确定模块,被配置为确定所述第一计算图的显存峰值与运行数据之间的关联关系;The first determination module is configured to determine the correlation between the peak value of the video memory of the first calculation graph and the running data;
    第二生成模块,被配置为基于所述关联关系,对所述第一计算图进行调整,生成至少一个第二计算图;The second generation module is configured to adjust the first calculation graph based on the association relationship, and generate at least one second calculation graph;
    第二确定模块,被配置为基于所述至少一个第二计算图的显存峰值和运行时长,在所述至少一个第二计算图中确定目标计算图;The second determination module is configured to determine a target calculation graph in the at least one second calculation graph based on the peak value of the video memory and the running time of the at least one second calculation graph;
    第三确定模块,被配置为基于所述目标计算图,确定所述预设网络模型所需的显存空间。The third determination module is configured to determine the video memory space required by the preset network model based on the target computation graph.
  14. 根据权利要求13所述的装置,其中,所述第一生成部分,包括:The apparatus according to claim 13, wherein the first generating part comprises:
    第一生成子部分,被配置为基于所述预设网络模型,生成数据交换格式的计算图信息;The first generation subpart is configured to generate calculation graph information in a data exchange format based on the preset network model;
    第二生成子部分,被配置为基于所述计算图信息中的算子队列,生成所述计算图信息匹配的第一计算图。The second generation subpart is configured to generate a first computation graph matched by the computation graph information based on the operator queue in the computation graph information.
  15. 根据权利要求13或14所述的装置,其中,所述第一确定部分,包括:The device according to claim 13 or 14, wherein the first determining part comprises:
    第一确定子部分,被配置为确定所述第一计算图中显存峰值的出现时刻;The first determination subpart is configured to determine the occurrence moment of the video memory peak in the first calculation graph;
    第二确定子部分,被配置为确定所述第一计算图中算子的运行数据;The second determination subpart is configured to determine the operation data of the operator in the first calculation graph;
    第三确定子部分,被配置为确定所述运行数据的生成时刻和所述运行数据在所述第一计算图中的应用时刻;确定所述生成时刻和所述应用时刻,与所述显存峰值的出现时刻之间的时序关系,为所述 关联关系。The third determination subpart is configured to determine the generation time of the operation data and the application time of the operation data in the first calculation graph; determine the generation time and the application time, and the peak value of the video memory The timing relationship between the occurrence moments of is the association relationship.
  16. 根据权利要求13至15任一项所述的装置,其中,所述第二生成部分,包括:The device according to any one of claims 13 to 15, wherein the second generating part comprises:
    第四确定子部分,被配置为在所述第一计算图中,确定所述关联关系满足预设条件的目标运行数据;The fourth determination subpart is configured to determine, in the first calculation graph, the target operating data whose association relationship satisfies a preset condition;
    第一调整子部分,被配置为基于所述目标运行数据,对所述第一计算图进行调整,生成所述至少一个第二计算图。The first adjustment subpart is configured to adjust the first calculation graph based on the target operation data to generate the at least one second calculation graph.
  17. 根据权利要求16所述的装置,其中,所述第四确定子部分,包括:The apparatus according to claim 16, wherein the fourth determining subsection comprises:
    第一确定单元,被配置为在所述第一计算图中的算子的运行数据中,确定生成时刻在所述显存峰值的出现时间之前且应用时刻在所述显存峰值的出现时刻之后的运行数据,为满足所述预设条件的目标运行数据。The first determining unit is configured to, in the operation data of the operator in the first calculation graph, determine the operation whose generation time is before the occurrence time of the video memory peak value and whose application time is after the occurrence time of the video memory peak value The data is the target operation data satisfying the preset condition.
  18. 一种计算机存储介质,其中,所述计算机存储介质上存储有计算机可执行指令,该计算机可执行指令被执行后,能够实现权利要求1至12任一项所述的显存优化方法。A computer storage medium, wherein computer executable instructions are stored on the computer storage medium, and after the computer executable instructions are executed, the video memory optimization method described in any one of claims 1 to 12 can be implemented.
  19. 一种计算机设备,其中,所述计算机设备包括存储器和处理器,所述存储器上存储有计算机可执行指令,所述处理器运行所述存储器上的计算机可执行指令时能够实现权利要求1至12任一项所述的显存优化方法。A computer device, wherein the computer device includes a memory and a processor, the memory has computer-executable instructions stored thereon, and the processor is capable of implementing claims 1 to 12 when running the computer-executable instructions on the memory The video memory optimization method described in any one.
  20. 一种计算机程序产品,所述计算机程序产品包括计算机程序或指令,在所述计算机程序或指令在电子设备上运行的情况下,使得所述电子设备执行权利要求1至12中任一项所述的显存优化方法。A computer program product, the computer program product comprising a computer program or an instruction, when the computer program or instruction is run on an electronic device, the electronic device is made to execute any one of claims 1 to 12 memory optimization method.
PCT/CN2022/093101 2021-10-27 2022-05-16 Video memory optimization method and apparatus, device, storage medium and program product WO2023071149A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111254294.7 2021-10-27
CN202111254294.7A CN114003306B (en) 2021-10-27 2021-10-27 Video memory optimization method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023071149A1 true WO2023071149A1 (en) 2023-05-04

Family

ID=79924245

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/093101 WO2023071149A1 (en) 2021-10-27 2022-05-16 Video memory optimization method and apparatus, device, storage medium and program product

Country Status (2)

Country Link
CN (1) CN114003306B (en)
WO (1) WO2023071149A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114003306B (en) * 2021-10-27 2024-03-15 上海商汤科技开发有限公司 Video memory optimization method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309479A (en) * 2020-02-14 2020-06-19 北京百度网讯科技有限公司 Method, device, equipment and medium for realizing task parallel processing
US20200293838A1 (en) * 2019-03-13 2020-09-17 Deepmind Technologies Limited Scheduling computation graphs using neural networks
US20210019184A1 (en) * 2019-07-17 2021-01-21 Google Llc Scheduling operations on a computation graph
CN112767230A (en) * 2021-02-26 2021-05-07 清华大学 GPU graph neural network optimization method and device
CN112948079A (en) * 2021-02-18 2021-06-11 北京百度网讯科技有限公司 Task scheduling method, device, equipment and computer storage medium
CN114003306A (en) * 2021-10-27 2022-02-01 上海商汤科技开发有限公司 Video memory optimization method, device, equipment and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321999B (en) * 2018-03-30 2021-10-01 赛灵思电子科技(北京)有限公司 Neural network computational graph optimization method
US11385875B2 (en) * 2019-01-31 2022-07-12 Google Llc Propagating reduced-precision on computation graphs
CN110908667B (en) * 2019-11-18 2021-11-16 北京迈格威科技有限公司 Method and device for joint compilation of neural network and electronic equipment
CN111158901B (en) * 2019-12-09 2023-09-08 爱芯元智半导体(宁波)有限公司 Optimization method, optimization device, computer equipment and storage medium for calculation graph
CN111338635B (en) * 2020-02-20 2023-09-12 腾讯科技(深圳)有限公司 Graph compiling method, device, equipment and storage medium for calculation graph
CN113449858A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Processing method of neural network model and related equipment
CN113469353A (en) * 2020-03-31 2021-10-01 上海商汤智能科技有限公司 Neural network model optimization method, data processing method and device
CN113296780A (en) * 2020-11-16 2021-08-24 阿里巴巴集团控股有限公司 Processing method, device and equipment of calculation graph
CN112882830A (en) * 2021-02-03 2021-06-01 北京迈格威科技有限公司 Video memory management method, video memory management device, model training device, electronic equipment and storage medium
CN112947933A (en) * 2021-02-24 2021-06-11 上海商汤智能科技有限公司 Operator execution method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200293838A1 (en) * 2019-03-13 2020-09-17 Deepmind Technologies Limited Scheduling computation graphs using neural networks
US20210019184A1 (en) * 2019-07-17 2021-01-21 Google Llc Scheduling operations on a computation graph
CN111309479A (en) * 2020-02-14 2020-06-19 北京百度网讯科技有限公司 Method, device, equipment and medium for realizing task parallel processing
CN112948079A (en) * 2021-02-18 2021-06-11 北京百度网讯科技有限公司 Task scheduling method, device, equipment and computer storage medium
CN112767230A (en) * 2021-02-26 2021-05-07 清华大学 GPU graph neural network optimization method and device
CN114003306A (en) * 2021-10-27 2022-02-01 上海商汤科技开发有限公司 Video memory optimization method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114003306B (en) 2024-03-15
CN114003306A (en) 2022-02-01

Similar Documents

Publication Publication Date Title
US10891322B2 (en) Automatic conversation creator for news
US11194448B2 (en) Apparatus for vision and language-assisted smartphone task automation and method thereof
JP2020518861A (en) Speech recognition method, apparatus, device, and storage medium
EP2991003A2 (en) Method and apparatus for classification
US20120047159A1 (en) Speculative query expansion for relevance feedback
US20160147932A1 (en) Enhanced parameter tuning for very-large-scale integration synthesis
WO2020167529A1 (en) Improving image classification modeling while mantaining data privacy compliance
CN113687897A (en) System and method for proactively providing recommendations to a user of a computing device
WO2019080685A1 (en) Video image segmentation method and apparatus, storage medium and electronic device
CN110674406A (en) Recommendation method and device, electronic equipment and storage medium
US11093510B2 (en) Relevance ranking of productivity features for determined context
KR101773781B1 (en) Method and apparatus for user oriented data visualzation based on the web
US20160078083A1 (en) Image display device, method for driving the same, and computer readable recording medium
WO2023071149A1 (en) Video memory optimization method and apparatus, device, storage medium and program product
US11516159B2 (en) Systems and methods for providing a comment-centered news reader
CN111967569A (en) Neural network structure generation method and device, storage medium and electronic equipment
KR20210132578A (en) Method, apparatus, device and storage medium for constructing knowledge graph
CN110019849B (en) Attention mechanism-based video attention moment retrieval method and device
CN111460384A (en) Policy evaluation method, device and equipment
CN112102448A (en) Virtual object image display method and device, electronic equipment and storage medium
KR101931624B1 (en) Trend Analyzing Method for Fassion Field and Storage Medium Having the Same
CN115312034A (en) Method, device and equipment for processing voice signal based on automaton and dictionary tree
CN110059224B (en) Video retrieval method, device and equipment of projector equipment and storage medium
CN116391188A (en) Session aspect emotion analysis for dialog understanding
WO2020167591A1 (en) Large margin tracking for attention-based end-to-end speech recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22885062

Country of ref document: EP

Kind code of ref document: A1