WO2023071149A1

WO2023071149A1 - Video memory optimization method and apparatus, device, storage medium and program product

Info

Publication number: WO2023071149A1
Application number: PCT/CN2022/093101
Authority: WO
Inventors: 赵成钢; 颜子杰; 张宇帆; 张行程
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-10-27
Filing date: 2022-05-16
Publication date: 2023-05-04
Also published as: CN114003306A; CN114003306B

Abstract

Embodiments of the present disclosure provide a video memory optimization method and apparatus, a device, a storage medium and a program product. The method comprises: generating a first calculation graph on the basis of a preset network model; determining an association relationship between a video memory peak value of the first calculation graph and operation data; adjusting the first calculation graph on the basis of the association relationship to generate at least one second calculation graph; determining a target calculation graph in the at least one second calculation graph on the basis of the video memory peak value and the operation duration of the at least one second calculation graph; and determining a video memory space required by the preset network model on the basis of the target calculation graph.

Description

A video memory optimization method, device, equipment, storage medium and program product

Cross References to Related Applications

This patent application claims the priority of the Chinese patent application number 202111254294.7 submitted on October 27, 2021, the applicant is Shanghai Shangtang Technology Development Co., Ltd., and the application name is "a video memory optimization method, device, equipment and storage medium", The entirety of this application is incorporated by reference into this disclosure.

technical field

Embodiments of the present disclosure relate to the technical field of data processing, and relate to, but are not limited to, a video memory optimization method, device, equipment, storage medium, and program product.

Background technique

With the rapid development of the field of deep learning, the training of large models with super-parameters or even super-large models has gradually entered people's field of vision. As the training model gradually becomes larger and deeper, the overhead of video memory will inevitably increase. When the model is further enlarged, or the batch size is increased, the memory usage of model training will also increase, and finally higher than that of the graphics card. The memory capacity makes the model unable to train.

Contents of the invention

Embodiments of the present disclosure provide a technical solution for video memory optimization.

The technical scheme of the embodiment of the present disclosure is realized in this way:

An embodiment of the present disclosure provides a video memory optimization method, the method comprising:

generating a first calculation graph based on a preset network model;

Determine the correlation between the peak value of the video memory of the first calculation graph and the running data;

Adjusting the first calculation graph based on the association relationship to generate at least one second calculation graph;

Determining a target computation graph in the at least one second computation graph based on the video memory peak value and runtime of the at least one second computation graph;

Based on the target computation graph, the video memory space required by the preset network model is determined.

In some embodiments, the generating the first computation graph based on the preset network model includes: generating computation graph information in a data exchange format based on the preset network model; based on the operator queue in the computation graph information , generating a first computation graph matching the computation graph information. In this way, by using the deep learning framework to generate a calculation graph in JSON format for the model that needs to be trained, the first calculation graph can be obtained, and then the first calculation graph can be adjusted through various optimization schemes to generate multiple second calculation graphs for easy screening. Computational graph with optimal memory cost.

In some embodiments, the determining the correlation between the peak value of the video memory in the first calculation graph and the running data includes: determining the occurrence time of the peak value of the video memory in the first calculation graph; The operation data of the operator; determine the generation time of the operation data and the application time of the operation data in the first calculation graph; determine the generation time and the application time, and the occurrence of the video memory peak The timing relationship between moments is the association relationship. In this way, by analyzing the timing relationship between the time of running data generated by the operator, the time when the running data is applied, and the time when the peak value of the video memory is reached, it can be further determined whether the operator needs to be moved to reduce the peak value.

In some embodiments, adjusting the first calculation graph based on the association relationship to generate at least one second calculation graph includes: determining that the association relationship meets a preset condition in the first calculation graph The target operation data; based on the target operation data, adjust the first calculation graph to generate the at least one second calculation graph. In this way, the second calculation graph can be obtained by moving the operator corresponding to the target operation data in the first calculation graph, so that the peak value of the second calculation graph can be reduced.

In some embodiments, in the first calculation graph, determining the target operation data whose association relationship satisfies a preset condition includes: determining in the operation data of operators in the first calculation graph The operation data whose generation time is before the occurrence time of the video memory peak value and whose application time is after the occurrence time of the video memory peak value is the target operation data satisfying the preset condition.

In some embodiments, the adjusting the first calculation graph based on the target operation data to generate the at least one second calculation graph includes: determining the target in the first calculation graph Running the target operator corresponding to the data; adjusting the target operator in the first calculation graph based on the occurrence time of the video memory peak in the first calculation graph to generate the at least one second computation graph. In this way, for the first computation graph, by moving the target operator according to the occurrence time of the video memory peak in the first computation graph, multiple second computation graphs can be generated.

In some embodiments, the adjusting the target operator in the first computation graph based on the occurrence time of the video memory peak value in the first computation graph to generate the at least one second computation graph includes : In the first computation graph, after the execution time of the target operator is adjusted to the occurrence time of the video memory peak value, the second computation graph is generated. In this way, by moving the target operator to the peak value of the video memory, the peak value of the newly generated first computation graph can be reduced, thereby optimizing the video memory space required by the second computation graph.

In some embodiments, the determining the target computing graph in the at least one second computing graph based on the peak video memory and running time of the at least one second computing graph includes: obtaining a preset video memory overhead and a preset trade-off Ratio; wherein, the preset trade-off ratio is used to weigh the proportion between the running time of the calculation graph and the required video memory; based on the preset video memory overhead and the preset trade-off ratio, the Scoring the peak value of video memory, running time and the running time of the corresponding first computing graph to obtain the scoring result of each second computing graph; based on the scoring result of each second computing graph, at least The second computation graphs in a second computation graph are sorted to obtain a sort queue; based on the sort queue, the target computation graph is determined in the at least one second computation graph. In this way, the target calculation graph with the optimal memory overhead can be found through queue search.

In some embodiments, based on the preset video memory overhead and the preset trade-off ratio, scoring the video memory peak value, running time of each second computing graph and the running time of the corresponding first computing graph , obtaining the scoring result of each second computation graph, including: determining each second computation graph based on the peak video memory value of each second computation graph, the preset video memory overhead, and the preset trade-off ratio The video memory score; based on the preset trade-off ratio, the runtime of each second computation graph and the runtime of the corresponding first computation graph, determine the runtime score of each second computation graph; A scoring result of each second computing image is determined based on the video memory score and the running time score of each second computing graph. In this way, in the stage of scoring the second computing graph, the running time of the second computing graph and the peak value of video memory are comprehensively considered, so that the final computing graph can optimize the video memory space without sacrificing a lot of time cost.

In some embodiments, the determining the target computation graph in the at least one second computation graph based on the sorting queue includes: searching for the first candidate computation with the best scoring result in the ranking queue Graph; determining that the first candidate computation graph is the target computation graph in response to the found video memory space required by the first candidate computation graph meeting the preset display memory overhead. In this way, the number of searches can be reduced as much as possible, and the memory overhead of the searched target calculation graph can be reduced.

In some embodiments, the determining the target computation graph in the at least one second computation graph based on the sorting queue includes: responding to the fact that the video memory space required by the first candidate computation graph does not meet the required The preset video memory overhead, based on the target operating data of the first candidate computation graph, adjust the first candidate computation graph to obtain at least one third computation graph; based on the scoring result of the at least one third computation graph Updating the arrangement queue to obtain an updated arrangement queue; in the updated arrangement queue, searching whether the video memory space required by the second candidate calculation graph with the best scoring result satisfies the preset video memory overhead; in response The video memory space required by the second candidate calculation graph does not meet the preset video memory overhead, and the number of searches reaches the preset number threshold, and the calculation graph with the best scoring result in the queue corresponding to the last search is determined to be the target calculation graph picture. In this way, if the calculation graph that meets the preset memory overhead cannot be searched, the calculation graph with the best score in the latest arrangement queue is used as the target calculation graph, so that the searched target calculation graph is the calculation graph with the optimal space cost. picture.

In some embodiments, the determining the video memory space required by the preset network model based on the target computation graph includes: determining the video memory space required by the target computation graph as training the preset network The video memory space required by the model. In this way, the video memory space required for training the network model can be optimized without sacrificing the running time required for training the network model.

An embodiment of the present disclosure provides a video memory optimization device, the device comprising:

The first generation module is configured to generate a first calculation graph based on a preset network model;

The first determination module is configured to determine the correlation between the peak value of the video memory of the first calculation graph and the running data;

The second generation module is configured to adjust the first calculation graph based on the association relationship, and generate at least one second calculation graph;

The second determination module is configured to determine a target calculation graph in the at least one second calculation graph based on the peak value of the video memory and the running time of the at least one second calculation graph;

The third determination module is configured to determine the video memory space required by the preset network model based on the target computation graph.

An embodiment of the present disclosure provides a computer storage medium, on which computer-executable instructions are stored. After the computer-executable instructions are executed, the above video memory optimization method can be realized.

An embodiment of the present disclosure provides a computer device. The computer device includes a memory and a processor. Computer-executable instructions are stored in the memory. When the processor runs the computer-executable instructions in the memory, the above-mentioned Memory optimization method.

An embodiment of the present disclosure also provides a computer program product, the computer program product includes computer readable code, and when the computer readable code runs in an electronic device, a processor of the electronic device executes any one of the above-mentioned The video memory optimization method described in the embodiment.

Embodiments of the present disclosure provide a video memory optimization method, device, device, storage medium, and program product. For the acquired preset network model, firstly, by generating the first calculation graph, and analyze the correlation between the peak video memory required for operation in the first calculation graph and the operating data of each operator, so that at least one second calculation graph can be generated by optimizing the first calculation graph; then, Comprehensively consider the peak value of the video memory of the second computing graph and the running time required to run the second computing graph, and search for the target computing graph in multiple second computing graphs; in this way, the searched target computing graph can optimize the video memory space, It can also take into account the running time. Finally, through the target calculation graph, the optimization of the video memory space required by the preset network model is realized. In this way, by searching for the target computing graph with the optimal memory cost in the generated multiple second computing graphs, and in the process of searching the target computing graph, the time overhead of the computing graph is also integrated into the consideration of the computing graph , so that the final target calculation graph satisfies both the space budget and the time budget, thereby optimizing the video memory space without sacrificing a lot of time cost.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the following will briefly introduce the accompanying drawings used in the embodiments. The accompanying drawings here are incorporated into the specification and constitute a part of the specification. The drawings show embodiments consistent with the embodiments of the present disclosure, and are used together with the description to illustrate the technical solutions of the embodiments of the present disclosure. It should be understood that the following drawings only show some embodiments of the embodiments of the present disclosure, and therefore should not be regarded as limiting the scope. For those of ordinary skill in the art, without any creative work, Other related drawings can also be obtained from these drawings.

FIG. 1 is a schematic diagram of an implementation flow of a video memory optimization method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure;

FIG. 3A is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure;

FIG. 3B is a schematic diagram of an application scenario of a video memory optimization method provided by an embodiment of the present disclosure;

FIG. 4A is a schematic flow diagram of yet another implementation of the video memory optimization method provided by the embodiment of the present disclosure;

FIG. 4B is a schematic diagram of another application scenario of the video memory optimization method provided by the embodiment of the present disclosure;

FIG. 5 is a schematic diagram of the curve change of the calculation graph occupied by the video memory provided by the embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an implementation flow of a video memory optimization method provided by an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of the structure and composition of a video memory optimization device according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of the composition and structure of a computer device according to an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the specific technical solutions of the invention will be further described in detail below in conjunction with the drawings in the embodiments of the present disclosure. The following examples are used to illustrate the present disclosure, but not to limit the scope of the present disclosure.

In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.

In the following description, the term "first\second\third" is only used to distinguish similar objects, and does not represent a specific ordering of objects. Understandably, "first\second\third" Where permitted, the specific order or sequencing may be interchanged such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terms used herein are only for the purpose of describing the embodiments of the present disclosure, and are not intended to limit the present disclosure.

Before the embodiments of the present disclosure are further described in detail, the nouns and terms involved in the embodiments of the present disclosure are described, and the nouns and terms involved in the embodiments of the present disclosure are applicable to the following explanations.

(1) Calculation graph, which is used to graphically represent the calculation process. A computational graph is a "language" for describing equations. Since it is a graph, it has nodes (for example, variables) and edges (operations (for example, simple functions)). In the field of deep learning, a neural network model can essentially be represented by a computational graph, and its training process can be divided into three parts: forward propagation, back propagation, and parameter update.

(2) Video memory, also called frame buffer, is used to store the data processed by the graphics card chip or the rendering data to be extracted. Like computer memory, video memory is a component used to store graphics information to be processed.

The exemplary application of the video memory optimized device provided by the embodiment of the present disclosure is described below. The device provided by the embodiment of the present disclosure can be implemented as a notebook computer, a tablet computer, a desktop computer, a camera, a mobile device (for example, a personal digital Various types of user terminals such as assistants, dedicated messaging devices, portable game devices, etc., can also be implemented as servers. Below, an exemplary application when the device is implemented as a terminal or a server will be described.

The method can be applied to a computer device, and the functions realized by the method can be realized by calling a program code by a processor in the computer device. Of course, the program code can be stored in a computer storage medium. It can be seen that the computer device includes at least a processor and a storage device. medium.

An embodiment of the present disclosure provides a video memory optimization method, as shown in FIG. 1 , which is described in conjunction with the steps shown in FIG. 1 :

Step S101, generating a first computation graph based on a preset network model.

In some embodiments, the preset network model may be any type of network model to be trained, such as a deep neural network model to be trained, a residual network model to be trained, or any large-scale neural network model to be trained. The first calculation graph is a graphical representation of the calculation process of the preset network model, including connected nodes and edges. Wherein, the nodes represent each operator executing tasks in the computation graph, and the edges are used to connect each operator according to the execution sequence of each task in the network model.

In some possible implementations, the preset network model is input into a deep learning framework (such as tensorflow, pytorch, or parrots, etc.), and the network model is generated into calculation graph information in JavaScript Object Notation (JSON) format ; Arrange the operator queue by reading in the calculation graph information in JSON format, and match the input data and output data of each operator with the operation data in turn, so as to generate a first calculation graph; in this way, based on each deep learning framework, generate Computational graph information in JSON format can generate a computational graph. In a specific example, the network model is used as an image recognition network to be trained, which includes an input layer, a convolutional layer, a pooling layer, and a fully connected layer; first, the image recognition network is input to different depths The calculation graph information in JSON format is extracted in the learning framework, and the calculation graph information in JSON format corresponding to each framework is obtained. In the calculation graph information, the input layer, convolutional layer, pooling layer and fully connected layer are respectively represented as execution Operators of different tasks; then, according to the execution order of these operators, different operators are connected through operation edges to obtain the first calculation graph representing the image recognition network.

Step S102, determining the correlation between the peak value of the video memory and the running data of the first calculation graph.

In some embodiments, the peak value of the video memory of the first computation graph is the peak value of the video memory space required by the first computation graph during operation. By traversing the first computation graph, it is possible to determine the video memory space required to run all the operators in the first computation graph sequentially, as well as the video memory peak value that occurs during the entire running process. The running data refers to the running data of the first computation graph. The running data of the first computation graph includes: data generated by each operator during the process of traversing the first computation graph. In this step, by traversing the first computation graph, the peak value of the video memory required by the first computation graph and the operating data of each operator can be obtained, so as to analyze the time when the video memory peak value appears and the operating data of each operator The timing relationship between the generation time and the application time of the running data; and the timing relationship is used as the association relationship between the video memory peak value of the first calculation graph and the running data. For example, the generation time of the operation data of some operators and the application time of the operation data are both after the occurrence time of the peak value of the video memory, or the generation time of the operation data of some operators and the application time of the operation data are both before the time of the peak value of the video memory. Or, the running data generation time of some operators is before the time when the video memory peak occurs, and the application time of the running data is after the video memory peak time.

Step S103, based on the association relationship, adjust the first computation graph to generate at least one second computation graph.

In some embodiments, at least one of the first calculation graphs is determined according to the timing relationship between the occurrence time of the video memory peak in the first calculation graph, the generation time of the operator's operation data, and the application time of the operation data. An optimization scheme; and adjusting the first calculation graph based on the optimization scheme to generate a second calculation graph corresponding to each optimization scheme. In some possible implementations, the first computation graph is adjusted by any optimization scheme of the first computation graph, and the adjusted first computation graph is used as the second computation graph, that is, the second computation graph is obtained by It is obtained after adjusting the first calculation graph by adopting the optimization scheme. In this way, in the case of determining multiple optimization schemes for the first calculation graph, multiple second calculation graphs can be obtained by adjusting the first calculation graph one by one through the multiple optimization schemes.

In some possible implementation manners, the adjustment to the first calculation graph may be to adjust the first calculation graph by analyzing the relationship between the peak value of the video memory and the running data. If the association relationship satisfies the preset condition, that is, if the generation time of the operation data of any operator in the first calculation graph is before the occurrence time of the video memory peak, but the application time of the operation data of the operator is before the occurrence time of the video memory peak After the occurrence time, then move the execution time of the operator in the first calculation graph to after the peak value of the video memory, realize the adjustment to the first calculation graph, and obtain the second calculation graph. In the first calculation diagram, the operator corresponding to the target operating data is screened out, and the target operating data is the operating data whose generation time is before the occurrence time of the video memory peak value, and the application time of the operation data is after the video memory peak value occurrence time; At least one second calculation graph is obtained by moving the operator corresponding to the target operation data in the first calculation graph.

In some possible implementations, the preset condition can be set according to the sequence relationship between the time when the operator's running data is generated, the time when the running data is applied, and the time when the peak value of the video memory occurs. For example, setting the preset condition The generation time of the operation data of the operator is before the occurrence time of the video memory peak and the application time of the operation data is after the occurrence time of the video memory peak. In this way, among the operating data of each operator in the first calculation graph, the target operating data whose operating data generation time of the operator is before the occurrence time of the video memory peak value and whose application time is after the video memory peak value occurrence time point is filtered out. Each target operation data is used as an optimization scheme of the first calculation graph, and at least one second calculation graph is generated by moving the operators corresponding to the target operation data.

Step S104: Determine a target computation graph in the at least one second computation graph based on the peak value of the video memory and the runtime of the at least one second computation graph.

In some embodiments, by traversing the second computation graph, the peak value of the video memory space required by the second computation graph is obtained, that is, the peak value of the video memory of the second computation graph, and the running time required to run the second computation graph, that is, the first 2. The running time of the computation graph. The target computation graph can be obtained by combining the running time of the second computation graph and searching for a computation graph whose video memory overhead satisfies the preset video memory space in the plurality of second computation graphs.

In some possible implementations, each second calculation graph is re-run to obtain the peak memory and running time of the second computing graph; by obtaining the set memory overhead budget and setting the trade-off between running time and memory space The parameters of the specific gravity, combined with the newly generated second calculation graph's video memory peak value, running time, and the running time of the original computing graph corresponding to the second computing graph, determine the score of the second computing graph; according to the score of each second computing graph Scoring, searching for a target computing graph whose video memory overhead meets a preset video memory space from multiple second computing graphs, or searching for a target computing graph with an optimal video memory overhead.

Step S105, based on the target computation graph, determine the display memory space required by the preset network model.

In some embodiments, by running the target computation graph, it is possible to estimate the video memory space required for training the preset network model and the time it takes. The running time of the target computing graph is used as the duration of training the preset network model, and the video memory space required for running the target computing graph is used as the video memory space required for training the preset network model, and as the model increases As well as the increase of the batch size, the optimized video memory space of the target calculation graph will also become larger, so that the video memory space for training the network model can be further optimized.

In the embodiment of the present disclosure, for the obtained preset network model, firstly, by generating the first calculation graph representing the operation process of the network model, and analyzing the peak value of the video memory required for operation in the first calculation graph and the value of each operator Run the correlation between the data, so that at least one second calculation graph can be generated by optimizing the first calculation graph; then, comprehensively consider the peak value of the video memory of the second calculation graph and the running time required to run the second calculation graph , search for the target computation graph in multiple second computation graphs; in this way, the searched target computation graph can not only optimize the video memory space, but also take into account the running time. Finally, through the target calculation graph, the optimization of the video memory space required by the preset network model is realized. In this way, by searching for the target computing graph with the optimal memory cost in the generated multiple second computing graphs, and in the process of searching the target computing graph, the time overhead of the computing graph is also integrated into the consideration of the computing graph , so that the final target calculation graph satisfies both the space budget and the time budget, thereby optimizing the video memory space without sacrificing a lot of time cost.

In some embodiments, multiple first calculation graphs are generated by reading in JSON-formatted calculation graph information extracted by different deep learning frameworks, that is, the above step S101 can be implemented through the following steps S111 and S112 (not shown):

Step S111, based on the preset network model, generating computation graph information in a data exchange format.

In some embodiments, the preset network model is input into different deep learning frameworks to extract the calculation graph information of the preset network model in JSON file format, the calculation graph information includes operators performing different tasks, Operands and the input and output of each operator, etc.

Step S112, based on the operator queue in the computation graph information, generate a first computation graph matching the computation graph information.

In some embodiments, by analyzing the order in which each operator performs tasks in the preset neural network, each operator is arranged into a queue according to the order in which the tasks are performed; and the input and output of each operator are sequentially compared with the operands Matching is to connect operators through operation edges to form a first calculation graph; then based on the calculation graph information in JSON format extracted by using different deep learning frameworks, multiple first calculation graphs, that is, the first calculation graph can be obtained. In this way, by using the deep learning framework to generate a calculation graph in JSON format for the model that needs to be trained, the first calculation graph can be obtained, and then the first calculation graph can be adjusted through various optimization schemes to generate multiple second calculation graphs for easy screening. Computational graph with optimal memory cost.

In some embodiments, by analyzing the timing relationship between the generation time and application time of the operator's operating data in the first calculation graph, and the time when the peak value appears, the correlation between the video memory peak value and the operating data in the first calculation graph is obtained, That is, the above step S102 can be realized through the following steps S121 to S124 (not shown in the figure):

Step S121, determining the occurrence time of the video memory peak in the first calculation graph.

In some embodiments, by traversing and running the first computing graph, it is possible to determine the moment when the video memory space required by the first computing graph reaches its peak; that is, during the running of the first computing graph, the peak value of the video memory occurs within the entire running time time.

Step S122, determining operation data of operators in the first computation graph.

In some embodiments, by traversing and running each first calculation graph, the data generated by each operator in the first calculation graph during operation, the time when the data is generated, and the time when the data is applied, that is, the running data , the generation time of the operation data and the application time of the operation data.

Step S123, determining the generation time of the operation data and the application time of the operation data in the first calculation graph.

In some embodiments, by traversing and running each first computation graph, it is possible to obtain the occurrence time of the video memory peak in the first computation graph, the generation time of the operation data generated by each operator during the operation process, and the application time of the operation data .

Step S124, determining the timing relationship between the generation time, the application time, and the occurrence time of the video memory peak as the association relationship.

In some embodiments, the correlation between the peak video memory and the running data is obtained by analyzing the timing relationship between the generation time of the running data, the application time of the running data, and the occurrence time of the video memory peak. In this way, by analyzing the timing relationship between the time of running data generated by the operator, the time when the running data is applied, and the time when the peak value of the video memory is reached, it can be further determined whether the operator needs to be moved to reduce the peak value.

In some embodiments, by analyzing the timing relationship between the generation time of the operator's operation data and the application time of the operation data, and the peak time, it is determined whether the operation data meets the preset conditions, and then the first calculation graph is adjusted. , to generate at least one second calculation graph, that is, the above step S103 can be realized through the steps shown in FIG. 2 . FIG. The steps are described below:

Step S201, in the first calculation diagram, determine the target operating data whose association relationship satisfies a preset condition.

In some embodiments, among the operation data of the operators in the first calculation graph, it is determined that the generation time is before the occurrence time of the video memory peak and the application time is after the occurrence time of the video memory peak, Running data for a target meeting the preset condition. In some possible implementation manners, among the operation data of the operators in each first computation graph, determine the operation data whose generation time is before the occurrence time of the video memory peak value; in response to the application of the target operation data If the time is after the occurrence time of the video memory peak value, the running data is determined to be the target running data.

In some embodiments, after running the first calculation graph, the operation data generated by each operator in the first calculation graph during operation is obtained, and from the operation data of these operators, it is found that the generation time of the operation data is within the first calculation graph. The running data before the peak value of video memory of a computing graph occurs. That is, the running data is generated by the operator before the peak value of the video memory, which means that the video memory occupied by the running data is included in the peak value of the video memory. After finding the target operating data, determine the application time of the target operating data; and judge whether the application time is after the occurrence time of the video memory peak, if the application time of the operation data is after the occurrence time of the video memory peak value, it means the The time when the operator generates the running data in the process of running the first calculation graph is before the occurrence time of the video memory peak value, but the application time of the running data is after the occurrence time of the video memory peak value; Size, but it is not used before reaching the peak value of the video memory, but after reaching the peak value of the video memory, and such operating data is used as the target operating data.

Step S202, based on the target operation data, adjust the first calculation graph to generate the at least one second calculation graph.

In some embodiments, in the first calculation graph, after the operator corresponding to the target operating data whose association relationship satisfies the preset condition is moved to the peak time, the second calculation graph is obtained.

In the embodiment of the present disclosure, in the first calculation graph, the target operating data in the case where the generation time of the operating data is before the occurrence time of the video memory peak and the application time of the operating data is after the occurrence time of the display peak is screened in the first calculation graph. In the figure, the moving target runs the operator corresponding to the data to obtain the second calculation graph, so that the peak value of the second calculation graph can be reduced.

In some embodiments, the second computation graph is generated by moving the position of the operator corresponding to the target operation data in the first computation graph, and the target computation graph is searched based on information such as the peak value of the video memory of the second computation graph, That is, step S202 in FIG. 2 above can be realized through the steps shown in FIG. 3A . FIG. 3A is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure. The following description is made in conjunction with the steps shown in FIG. 3A :

Step S301, in the first computation graph, determine the target operator corresponding to the target operation data.

In some embodiments, for the first computation graph, among the multiple operators in the first computation graph, an operator that generates the target operation data, that is, a target operator is determined.

Step S302, based on the occurrence time of the video memory peak in the first computation graph, adjust the target operator in the first computation graph to generate the at least one second computation graph.

In some embodiments, after traversing the running computation graph, the moment when the video memory reaches the peak value in the first computation graph is obtained. Adjust the execution time of the target operator in the first computation graph according to this moment, that is, move the position of the target operator in the first computation graph to obtain the second computation graph. In this way, for the first computation graph, by moving the target operator according to the occurrence time of the video memory peak in the first computation graph, multiple second computation graphs are obtained. As shown in FIG. 3B , if it is determined by analyzing the optimization scheme of the first calculation graph 31 that the first calculation graph 31 includes two target operators, that is, the first calculation graph 31 has two optimization schemes. Any target operator is moved in the first computation graph 31 to generate two optimized computation graphs, that is, the second computation graph 32 and the second computation graph 33 .

In some possible implementations, the first calculation graph is optimized by moving the target operator to after the peak of the video memory to generate the second calculation graph, that is, the above step S302 can be implemented through the following process:

In the first computation graph, the second computation graph is generated after the execution time of the target operator is adjusted to the occurrence time of the video memory peak.

Here, in the first computation graph, after the execution time of the target operator is adjusted to the occurrence time of the video memory peak value, a second computation graph corresponding to the first computation graph is generated. For the first calculation graph, move the position of the target operator in the first calculation graph; because the time when the target operator generates the target operating data is before the peak time, and the target operating data is after the peak time. Therefore, by moving the execution time of the target operator to after the occurrence time of the video memory peak value, the peak value of the generated second calculation graph can be reduced. In this way, by moving the target operator to the peak value of the video memory, the peak value of the newly generated first computation graph can be reduced, thereby optimizing the video memory space required by the second computation graph.

In the embodiment of the present disclosure, by moving the target operator in the second computation graph to after the occurrence time of the video memory peak value, a third computation graph with a reduced video memory peak value can be generated, so that it is convenient to search for an existing operator in at least one third computation graph. A target computation graph that optimizes memory overhead.

In some embodiments, by scoring the memory space overhead of each second computation graph, sorting the second computation graph, and searching for the target computation graph according to the sorted queue, that is, the above step S303 can be performed through the Steps to achieve, Figure 4A is a schematic flow diagram of another implementation of the video memory optimization method provided by the embodiment of the present disclosure, combined with the steps shown in Figures 3A and 4A, the following description is made:

Step S401, acquiring a preset video memory overhead and a preset trade-off ratio.

In some embodiments, the preset video memory overhead is a set video memory application amount, that is, a preset video memory space size. The preset trade-off ratio is a parameter set in advance to weigh the proportion of time and space in the score, and the ratio is less than 1. The preset trade-off ratio is used to weigh the proportion between the running time of the calculation graph and the required video memory.

Step S402, based on the preset video memory overhead and the preset trade-off ratio, score the video memory peak value, running time of each second computing graph, and the running time of the corresponding first computing graph, and obtain the - Scoring results of the second computation graph.

In some embodiments, the set preset video memory overhead and the preset trade-off ratio are combined with the peak video memory and runtime of the second computation graph, and at the same time, the original computation graph of the second computation graph is considered comprehensively (that is, the second computation graph is generated The running time of the first computing graph) is used to evaluate the memory overhead and running time of the second computing graph, so as to obtain the scoring result of the second computing graph. In this way, by comprehensively considering the memory overhead and running time of each second calculation graph, each second calculation graph is scored, and the scoring result of each second calculation graph is obtained. The scoring result may be a score, and a larger score indicates better overall performance of the second computation graph's memory overhead and runtime.

In some possible implementations, the scoring result of the second computing graph is obtained by comprehensively evaluating the running time of the second computing graph and the peak value of the video memory, that is, the above step S402 can be performed through the following steps S421 to S423 (not shown out) to achieve:

Step S421, based on the peak value of the video memory, the preset video memory cost and the preset trade-off ratio of each second computing graph, determine the video memory score of each second computing graph.

In some embodiments, for any second computation graph, the peak value of the second computation graph is subtracted from the preset memory overhead to obtain a difference (the difference can be a positive number or a negative number, for example, in If the peak value of the video memory of the second computation graph is greater than the default memory cost, the difference is a negative number; if the peak value of the memory of the second computation graph is smaller than the preset memory cost, the difference is a positive number) . The video memory score of the second computation graph can be obtained by multiplying the difference with the set preset trade-off ratio.

Step S422, based on the preset trade-off ratio, the running time of each second computing graph, and the corresponding running time of the first computing graph, determine the running time score of each second computing graph.

In some embodiments, a preset standard parameter (for example, set to 1) minus a preset trade-off ratio is used as the ratio of the estimated running time. The running time of the second calculation graph is subtracted from the running time of the corresponding second computing graph to obtain a time difference (generally, the time difference is a positive number). The running time score of the second calculation graph can be obtained by multiplying the calculated evaluation running time ratio by the time difference value.

Step S423, based on the video memory score and the running time score of each second calculation graph, determine the score result of each second calculation image.

In some embodiments, for any second calculation graph, the score of the second calculation graph can be obtained by adding the video memory score and runtime score of the second calculation graph, so that at least one second calculation graph can be obtained The comprehensive score of video memory and running time of each second computing graph in the graph. In this way, in the stage of scoring the second computing graph, the running time of the second computing graph and the peak value of video memory are comprehensively considered, so that the final computing graph can optimize the video memory space without sacrificing a lot of time cost.

Step S403, sort the second computation graphs in the at least one second computation graph based on the scoring result of each second computation graph, to obtain a sorted queue.

In some embodiments, the second calculation graphs are sorted according to the scores of each second calculation graph in at least one second calculation graph; The second computation graphs in the graph are sorted to obtain the sort queue; or, the second computation graphs in at least one second computation graph are sorted according to the scoring results from small to large to obtain the sort queue. As shown in Figure 3B, if the score of the second calculation 32 is greater than the score of the second calculation 33, the arrangement of the two second calculation graphs is as shown in Figure 3, the second calculation graph 32 is in front, and the second calculation graph 33 after.

Step S404, based on the sorting queue, determine the target computation graph in the at least one second computation graph.

In some embodiments, since the sorting queue is arranged based on the score size of the second computing graph, according to the sorting order in the sorting queue, first search whether the video memory overhead of the second computing graph with the highest score satisfies the preset video memory Overhead, if the video memory cost of the second computing graph with the highest score meets the preset video memory cost, use the second computing graph as the target computing graph; if not, continue to analyze whether there is target running data in the second computing graph, to generate an optimization scheme for the second calculation graph; thereby adjust the second calculation graph based on the optimization scheme to generate a third calculation graph, and list the third calculation graph in the sorting queue according to the scoring results of the third calculation graph , continue to search for the memory cost of the calculation graph with the highest score in the updated arrangement object to meet the preset memory cost; finally, when the number of searches reaches the upper limit and no calculation graph whose memory cost meets the preset memory cost is found, the current ranking The computation graph with the highest score in the queue is used as the target computation graph. In this way, by comprehensively considering the running time of the second computing graph and the peak value of the video memory, the second computing graph is scored and arranged in a priority queue according to the scoring results, so that the target computing graph with the optimal memory cost can be found through queue search.

In some possible implementations, the video memory space required by the second computing graph is searched according to the order in which the second computing graph is arranged in the queue, so as to search for a target computing graph whose video memory cost meets the preset video memory cost, that is, step S404 This can be achieved in a number of ways:

Method 1: Search the memory space of the second calculation graph with the best total scoring result of the queue to determine whether the memory overhead of the second calculation graph meets the preset memory overhead, including the following steps S441 and S442 (not shown in the figure) out):

Step S441, in the arrangement queue, search for the first candidate computation graph with the best scoring result.

In some embodiments, if the queue is sorted from largest to smallest based on the scores of the second computation graph, then the element arranged at the head of the queue is the first candidate computation graph with the best scoring result. By running the first candidate computation graph with the best scoring result, the video memory space required by the candidate computation graph can be determined.

Step S442, in response to the found video memory space required by the first candidate computation graph meeting the preset video memory overhead, determine the first candidate computation graph as the target computation graph.

In some embodiments, after the video memory space required by the first candidate computation graph is determined, it is judged whether the video memory space satisfies the preset video memory overhead, that is, it is judged whether the first candidate computation graph can be run normally within the preset video memory overhead. If the video memory space required by the first candidate computation graph satisfies the preset video memory overhead, it means that the video memory space required by the first candidate computation graph is within the preset video memory overhead range, that is, the video memory space corresponding to the first candidate computation graph can Complete the training of the preset network model. Furthermore, the first candidate computation graph is used as a target computation graph. In this way, based on the queuing sequence, firstly search whether the video memory space required by the first candidate computing graph with the best scoring result satisfies the preset video memory overhead, which can not only reduce the number of searches as much as possible, but also make the video memory of the searched target computing graph Less overhead.

Method 2: When the video memory cost of the second computing graph with the best scoring result does not meet the preset video memory cost, update the queue by analyzing the optimization scheme of the second computing graph, and continue to search for ratings in the updated sorting order Resulting in the optimal calculation graph, and judging whether the memory overhead of the calculation graph satisfies the preset memory overhead, including the following steps S443 to S446 (not shown):

Step S443, in response to the fact that the video memory space required by the first candidate computation graph does not meet the preset video memory overhead, adjust the first candidate computation graph based on the target operating data of the first candidate computation graph, At least one third computation graph is obtained.

In some embodiments, if the video memory space required by the first candidate computation graph does not meet the preset video memory overhead, it means that the first candidate computation graph still needs to be further optimized. Based on this, the target operator corresponding to the target operation data is analyzed in the first candidate computation graph, so that an optimized third computation graph is generated by moving the target operator in the first candidate computation graph. If the first candidate computation graph is the second computation graph 32 in FIG. 3B, by analyzing the second computation graph 32, it is determined that the second computation graph 32 includes three target operators; move each target in the second computation graph 32 respectively operator to generate three third calculation graphs; as shown in FIG. 4B , they are the third calculation graphs 41 , 42 and 43 respectively.

Step S444, updating the permutation queue based on the scoring result of the at least one third computation graph, to obtain an updated permutation queue.

In some embodiments, based on the first candidate computation graph, after generating a plurality of third computation graphs. Firstly, the first candidate computation graph is popped up in the arrangement queue, and then, according to the manner of step S401 and step S402, the scoring result of each third computation graph is determined. Finally, according to the scoring result of each third calculation graph combined with the scoring result of each second calculation graph in the arrangement queue, at least one third calculation graph is inserted into the arrangement queue to obtain an updated arrangement queue. As shown in Figure 4B, since the second calculation graph 32 has been popped up, only the second calculation graph 33 is left in the current queue. If the third calculation graph 41, 42 and 43 score results, the third The score of is greater than the second calculation graph 33, the scores of the third calculation graph 42 and 43 are both smaller than the second calculation graph 33, and the third calculation graph 42 is greater than the third calculation graph 43; then the updated alignment queue is shown in Figure 4B , in descending order of scores: the third calculation graph 41 , the second calculation graph 33 , the third calculation graph 42 and the third calculation graph 43 .

Step S445, in the updated queue, search whether the video memory space required by the second candidate computation graph with the best scoring result satisfies the preset video memory overhead.

In some embodiments, in the updated queue, after determining the video memory space required by the second candidate computation graph, it is judged whether the video memory space meets the preset video memory overhead, if the video memory required by the second candidate computation graph is The space does not meet the preset video memory overhead (for example, the video memory space required by the second candidate computation graph is greater than the preset video memory overhead); then continue to generate a new computation graph based on the optimization scheme of the second candidate computation graph, and follow The scoring result of the new calculation graph updates the updated permutation queue again.

Step S446, in response to the video memory space required by the second candidate computation graph meeting the preset video memory overhead, determine the second candidate computation graph as the target computation graph. In this way, based on the queuing sequence, when the memory space required by the calculation graph with the best scoring result does not meet the preset memory overhead, continue to optimize the calculation graph, and search for the calculation graph with the best score in the latest queue. Whether the video memory space meets the preset video memory overhead, so that after multiple searches, the video memory overhead of the searched target calculation graph can be made better.

Method 3: When the memory cost of the calculation graph with the best scoring result does not meet the preset memory cost, and the number of searches for the calculation graph reaches the set number threshold, the latest calculation graph with the best scoring result is queued As the target calculation graph, the following step S447 (not shown in the figure) is included:

Step S447, in response to the fact that the video memory space required by the second candidate computation graph does not meet the preset memory overhead, and the number of searches reaches the preset number threshold, determine the computation graph with the best scoring result in the queue corresponding to the last search Compute the graph for the target.

In some embodiments, the preset times threshold may be set based on the number of computation graphs in the alignment queue; for example, the preset times threshold is set to be less than half of the number of computation graphs in the alignment queue. If the video memory space required by the second candidate computation graph does not meet the preset video memory overhead, and after updating the queue based on this second candidate computation graph, the video memory space required by the computation graph with the best score that is searched again still does not meet the preset memory cost. Assuming the memory overhead, when the number of searches reaches the preset number threshold, determine the calculation graph with the best score in the latest arrangement queue as the target calculation graph. In this way, in the case that the calculation graph that meets the preset memory overhead cannot be searched, the calculation graph with the best score in the latest arrangement queue is used as the target calculation graph, so that the searched target calculation graph is the calculation graph with the optimal space cost. picture.

In the embodiment of the present disclosure, by comprehensively considering the running time of the second calculation graph and the peak value of the video memory, the multiple second calculation graphs are sorted, and the target calculation graph is searched according to the sorted queue, so that the searched target calculation graph is space overhead Optimal Computational Graph.

In some embodiments, before training the preset network model, the memory space and running time required by the preset network model can be determined by determining the target calculation graph of the preset network model, that is, the above step S105 can be achieved through the following process :

The video memory space required by the target computation graph is determined as the video memory space required for training the preset network model.

In the embodiment of the present disclosure, when training the network model in an actual scenario, by running the target computing graph, the video memory space required by the target computing graph and the running time are obtained; the video memory space required by the target computing graph, And the running time is used as the estimated video memory space and running time required for training the preset network model, so that the video memory space required for training the network model can be optimized without sacrificing the running time required for training the network model.

In the following, an exemplary application of an embodiment of the present disclosure in an actual application scenario will be described. For a large-scale deep neural network, the optimization of the memory occupied by the neural network based on the calculation graph of the neural network will be described as an example. .

In some embodiments, when performing image network (ImageNet) training of residual network 269 (ResNeSt269) (including 100 million network parameters), the video memory has already approached the upper limit of V100 32 gigabits (GB), and the training occupation reaches 28GB . When the model is further enlarged, or the batch size is increased, the video memory usage of model training will also increase accordingly, and finally the video memory occupied is higher than the video memory capacity of the graphics card, touching the video memory wall, making the model unable to train. As shown in Figure 5, Figure 5 is a schematic diagram of the curve change of the calculation graph occupied by the video memory provided by the embodiment of the present disclosure, where the abscissa indicates the execution sequence of each operator in the calculation graph, and the ordinate indicates the occupied memory during the operator execution process. The size of the memory space. Curve 501 represents the memory application amount, and curve 502 represents the memory cache amount at different moments during the execution of a task in the calculation graph. It can be seen from the curve 502 that the calculation graph reaches the peak point 503 at the end of the feed-forward phase. The peak point 503 exceeds the peak value of the curve 501, that is, the video memory occupied by the calculation graph is higher than the video memory capacity of the graphics card, so that the model training cannot continue. In this case, memory optimization is particularly important. Among many video memory optimization methods, the video memory optimization method based on computational graph analysis is one of them. In related technologies, in the calculation graph analysis and optimization method, operators are often simply moved to initially reduce video memory usage. Usually this kind of method is only used as a preliminary video memory optimization, and the focus is placed on subsequent further video memory optimization methods. This type of method can only optimize a small amount of video memory usage, and lacks a complete optimization system.

Based on this, an embodiment of the present disclosure provides a video memory optimization method. First, a network model is generated into a calculation graph through a conventional deep learning framework (tensorflow, pytorch, etc.) and framework parrots. Then, through the calculation graph, the large training task can be disassembled into individual operators (Task), each operator will use the original data (operator input) and generate new data (operator output). The calculation graph also shows the space occupied by each operand related to the operator and the time required for the operator to perform calculations, so that the memory usage can be optimized by analyzing the calculation graph. In this way, with the goal of reducing the peak value of memory usage, and considering the peak shift when moving operators, the optimal calculation graph can be obtained; and when evaluating the quality of the calculation graph, an evaluation function is proposed instead of only considering its occupation. Video memory is a factor, so that its corresponding time consumption can be considered comprehensively.

The implementation process of the video memory optimization method provided by the embodiment of the present disclosure is shown in Figure 6, which is a schematic flow diagram of the implementation process of the video memory optimization method provided by the embodiment of the present disclosure, and the following description is made in conjunction with the steps shown in Figure 6:

Step S601, read the calculation graph information from the JSON file.

In some embodiments, first use the machine learning framework (tensorflow, pytorch, and parrots) to generate a calculation graph in JSON format for the model that needs to be trained, and store it in the JSON file; then, start the video memory optimization process and read the calculation in JSON format picture. In the memory optimization process, define the calculation graph object task (schedule), and each calculation graph object represents a different calculation graph in the memory optimization process.

Step S602, based on the read calculation graph information, generate a corresponding calculation graph object.

In some embodiments, a new computation graph object is generated, the information read from JSON is obtained, the operator queue is arranged in order, and the input and output of the operator are matched with the operands in sequence.

Step S603, analyzing whether the current calculation graph object meets the memory space cost.

In some embodiments, analyzing the maximum space overhead of the current calculation graph object and the calculation time spent include:

(1) By accumulating the calculation time of all operators in the calculation graph object, the total calculation time is obtained.

(2) Obtain the topology structure and memory overhead of the calculation graph by traversing all operators and their input and output. If the computation graph object already meets the space cost, go to step S604 and do not need to optimize it. If the computation graph object does not satisfy the space cost, go to step S605.

Step S605, based on the topological structure of the computation graph object, search for a position that can be optimized, and add the new computation graph object generated based on the position into the priority queue.

In some embodiments, the implementation process of finding the location that can be optimized is as follows: first, find out the time point when the memory reaches the peak value; then, using the peak value as the dividing point, find the data that was generated before the peak value but used after the peak value, Where this data resides is where it can be optimized.

In some possible implementation manners, the operator corresponding to the generated data may be moved behind the peak value to achieve the purpose of reducing the peak value. All operators satisfying this condition are regarded as the optimized scheme of the current computation graph object. Combine the optimization scheme of the current calculation graph object to generate a series of new calculation graph objects, and add the generated new calculation graph objects as elements to the priority queue.

In some possible implementations, the queue is formed based on the scores of new computation graph objects, and the following formula is used to determine the score Score of each computation graph:

Score=MEMORY_FACTOR*(peak_memory-limit)/limit+(1-MEMORY_FACTOR)*(total_time-origin_time)/origin_time;

Among them, peak_memory indicates the peak memory usage of the calculation graph, limit indicates the memory overhead budget we set, total_time indicates the execution time corresponding to the calculation graph, and origin_time indicates the execution time corresponding to the initial calculation graph. MEMORY_FACTOR is a parameter that weighs the proportion of time and space in the score.

Step S606, integrating the first element of the priority queue into a new computation graph object, and judging whether the new computation graph object meets the space overhead.

In some embodiments, the first element of the priority queue (that is, the computation graph object with the best score) is popped out, and it is judged whether it satisfies the space cost, and if so, proceeds to step S607. If not, return to step S605 to continue the next search. If the searched computation graph object still does not meet the space cost and the number of searches reaches the preset upper limit, go to step S608.

Step S607, take the computation graph object satisfying the condition as an output and save it as a JSON file.

Step S608, the search is terminated, and the first element of the current priority queue is used as the optimal calculation graph.

The above steps S601 to S608 provide a search strategy for backtracking and searching the optimal calculation graph of a priority queue: First, use the time when the video memory reaches the peak value as the limit, and search for data that was generated before the peak but used after the peak , and take the position of the operator corresponding to such data as the position that can be optimized; secondly, calculate the score based on the optimized calculation graph of the position, and add it to the priority queue for searching according to the score; at the same time, considering the peak transfer situation, That is, every time a new calculation graph is generated with the optimization scheme, the peak video memory usage analysis will be performed. Finally, if no calculation graph that meets the requirements is found, the currently searched calculation graph with the best score will be returned at the end.

In the embodiment of the present disclosure, when the user trains the model in an actual situation, the calculation graph of the model may be analyzed and optimized using the video memory optimization method provided by the embodiment of the present disclosure. Users can have a general understanding of the memory space and time of the model based on the calculation graph. In the embodiment of the present disclosure, the calculation graph optimization is completed before starting the training, combined with more subsequent optimizations, the comprehensive memory optimization will be very considerable.

In a specific example, in the given calculation graph sample (pattern.json), the initial calculation graph memory usage peak value is 3.38 gigabits (GiB) and takes 129.03 milliseconds (ms), and the optimal calculation graph memory usage peak value is 1.72GiB took 136.09ms. In this case, the optimization rate of video memory reached 49%. As the training model increases and the batch size increases, the optimization space of the calculation graph will also increase. At this time, the video memory optimization method provided by the embodiments of the present disclosure will further increase the video memory optimization effect. In this way, the memory size occupied by large-scale deep learning can be greatly reduced, and the cost of large-scale training can be greatly reduced in terms of space overhead. Even if a calculation graph that meets the set space overhead cannot be found, this method can also give the current conditions. The calculation diagram with the best comprehensive score is shown below for user reference. In addition, the consideration of calculation time and cost is added to the calculation graph scoring process, so that the final calculation graph will not sacrifice a lot of time cost in order to optimize a small amount of video memory space.

Those skilled in the art can understand that in the above method of specific implementation, the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process. The specific execution order of each step should be based on its function and possible The inner logic is OK.

Based on the same inventive concept, the embodiment of the present disclosure also provides a video memory optimization device corresponding to the video memory optimization method. Since the problem-solving principle of the device in the embodiment of the present disclosure is similar to the above-mentioned video memory optimization method of the embodiment of the present disclosure, the implementation of the device See the implementation of the method.

An embodiment of the present disclosure provides a video memory optimization device. FIG. 7 is a schematic diagram of the structural composition of the video memory optimization device according to an embodiment of the present disclosure. As shown in FIG. 7 , the video memory optimization device 700 includes:

The first generation part 701 is configured to generate a first calculation graph based on a preset network model;

The first determination part 702 is configured to determine the correlation between the peak value of the video memory of the first calculation graph and the running data;

The second generation part 703 is configured to adjust the first calculation graph based on the association relationship, and generate at least one second calculation graph;

The second determining part 704 is configured to determine a target calculation graph in the at least one second calculation graph based on the peak value of the video memory and the running time of the at least one second calculation graph;

The third determining part 705 is configured to determine the video memory space required by the preset network model based on the target computation graph.

In some embodiments, the first generating part 701 includes:

The first generation subpart is configured to generate calculation graph information in a data exchange format based on the preset network model;

The second generation subpart is configured to generate a first computation graph matched by the computation graph information based on the operator queue in the computation graph information.

In some embodiments, the first determining part 702 includes:

The first determination subpart is configured to determine the occurrence moment of the video memory peak in the first calculation graph;

The second determination subpart is configured to determine the operation data of the operator in the first calculation graph;

The third determination subpart is configured to determine the generation time of the operation data and the application time of the operation data in the first calculation graph; determine the generation time and the application time, and the peak value of the video memory The timing relationship between the occurrence moments of is the association relationship.

In some embodiments, the second generating part 703 includes:

The fourth determination subpart is configured to determine, in the first calculation graph, the target operating data whose association relationship satisfies a preset condition;

The first adjustment subpart is configured to adjust the first calculation graph based on the target operation data to generate the at least one second calculation graph.

In some embodiments, the fourth determining subsection includes:

The first determining unit is configured to, in the operation data of the operator in the first calculation graph, determine the operation whose generation time is before the occurrence time of the video memory peak value and whose application time is after the occurrence time of the video memory peak value The data is the target operation data satisfying the preset condition.

In some embodiments, the first adjustment subsection includes:

The second determination unit is configured to determine a target operator corresponding to the target operation data in the first calculation graph;

The first adjustment unit is configured to adjust the target operator in the first computation graph based on the occurrence time of the video memory peak in the first computation graph to generate the at least one second computation graph.

In some embodiments, the first adjustment unit is further configured to:

In some embodiments, the second determining part 704 includes:

The first acquisition subpart is configured to acquire a preset video memory overhead and a preset trade-off ratio; wherein the preset trade-off ratio is used to weigh the ratio between the running time of the calculation graph and the required video memory;

The first scoring sub-section is configured to perform an evaluation on the peak value of video memory, the running time of each second computing graph, and the running time of the corresponding first computing graph based on the preset video memory overhead and the preset trade-off ratio Scoring, obtaining the scoring result of each of the second calculation graphs;

The first sorting subpart is configured to sort the second calculation graphs in the at least one second calculation graph based on the scoring results of each second calculation graph to obtain a sorting queue;

The fifth determining subpart is configured to determine the target computation graph in the at least one second computation graph based on the sort queue.

In some embodiments, the first scoring subsection includes:

The third determining unit is configured to determine the video memory score of each second computing graph based on the video memory peak value of each second computing graph, the preset video memory overhead, and the preset trade-off ratio;

The first scoring unit is configured to determine, based on the preset trade-off ratio, the running time of each second computing graph, and the corresponding running time of the first computing graph, the value of each second computing graph runtime rating;

The fourth determination unit is configured to determine the scoring result of each second calculation image based on the video memory score and the runtime score of each second calculation graph.

In some embodiments, the fifth determining subsection includes:

The first search unit is configured to search for the first candidate computation graph with the best scoring result in the arrangement queue;

The fifth determining unit is configured to determine the first candidate computation graph as the target computation graph in response to the searched video memory space required by the first candidate computation graph meeting the preset video memory overhead.

In some embodiments, the fifth determining subsection includes:

The second adjustment unit is configured to, in response to the video memory space required by the first candidate computation graph not satisfying the preset video memory overhead, based on the target running data of the first candidate computation graph, The calculation graph is adjusted to obtain at least one third calculation graph;

The first updating unit is configured to update the permutation queue based on the scoring result of the at least one third computation graph, to obtain an updated permutation queue;

The second search unit is configured to, in the updated queue, search whether the video memory space required by the second candidate computation graph with the best scoring result satisfies the preset video memory overhead;

The sixth determining unit is configured to determine the scoring result in the queue corresponding to the last search in response to the fact that the video memory space required by the second candidate computation graph does not meet the preset video memory overhead, and the number of searches reaches a preset number threshold The optimal computation graph is the target computation graph.

In some embodiments, the third determining part 705 is further configured to:

In the embodiments of the present disclosure and other embodiments, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.

It should be noted that the description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

It should be noted that, in the embodiments of the present disclosure, if the above video memory optimization method is implemented in the form of software function modules and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the prior art can be embodied in the form of software products, the computer software products are stored in a storage medium, including several instructions for A computer device (which may be a terminal, a server, etc.) is made to execute all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: various media that can store program codes such as U disk, sports hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. As such, the disclosed embodiments are not limited to any specific combination of hardware and software.

The embodiment of the present disclosure further provides a computer program product, the computer program product includes computer-executable instructions, and after the computer-executable instructions are executed, the video memory optimization method provided in the embodiments of the present disclosure can be implemented.

The embodiments of the present disclosure further provide a computer storage medium, where computer executable instructions are stored on the computer storage medium, and when the computer executable instructions are executed by a processor, the video memory optimization method provided in the foregoing embodiments is implemented.

An embodiment of the present disclosure provides a computer device. FIG. 8 is a schematic diagram of the composition and structure of a computer device in an embodiment of the present disclosure. As shown in FIG. 8 , the computer device 800 includes: a processor 801, at least one communication bus, and a communication interface 802 , at least one external communication interface and memory 803 . Wherein, the communication interface 802 is configured to realize connection and communication between these components. Wherein, the communication interface 802 may include a display screen, and the external communication interface may include a standard wired interface and a wireless interface. Wherein the processor 801 is configured to execute a video memory optimization program in the memory, so as to implement the video memory optimization method provided in the foregoing embodiments.

The above descriptions of the memory optimization device, computer equipment, and storage medium embodiments are similar to the descriptions of the above-mentioned method embodiments, and have similar technical descriptions and beneficial effects as the corresponding method embodiments. records, so I will not repeat them here. For the technical details not disclosed in the embodiments of the video memory optimization device, computer equipment, and storage medium of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

The device involved in the embodiments of the present disclosure may be at least one of a system, a method, and a computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present disclosure.

A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. Examples of computer-readable storage media (a non-exhaustive list) include: portable computer disks, hard disks, Random Access Memory (RAM), Read-Only Memory (ROM), erasable Electrical Programmable Read Only Memory (EPROM) or flash memory, Static Random-Access Memory (Static Random-Access Memory, SRAM), Portable Compact Disc Read-Only Memory (CD-ROM), Digital Video Discs (DVDs), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device over at least one of a network, such as the Internet, a local area network, a wide area network, and a wireless network. . The network may include at least one of copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for performing the operations of the present disclosure may be assembly instructions, Industry Standard Architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or in one or more source or object code written in any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages, such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it may be connected to an external computer (for example, using Internet Service Provider to connect via the Internet). In some embodiments, electronic circuits, such as programmable logic circuits, FPGAs, or programmable logic arrays (Programmable Logic Arrays, PLAs), can be customized by using state information of computer-readable program instructions, which can execute computer-readable Read program instructions, thereby implementing various aspects of the present disclosure.

It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic related to the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that in various embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, rather than by the embodiments of the present disclosure. The implementation process constitutes any limitation. The serial numbers of the above-mentioned embodiments of the present disclosure are for description only, and do not represent the advantages and disadvantages of the embodiments. It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, such as: multiple units or components can be combined, or May be integrated into another system, or some features may be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms of.

The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may be used as a single unit, or two or more units may be integrated into one unit; the above-mentioned integration The unit can be realized in the form of hardware or in the form of hardware plus software functional unit. Those of ordinary skill in the art can understand that all or part of the steps to realize the above method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in computer-readable storage media. When the program is executed, the execution includes The steps in the foregoing method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes such as removable storage devices, read-only memories, magnetic disks or optical disks. Alternatively, if the above-mentioned integrated units of the present disclosure are implemented in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the prior art can be embodied in the form of a software product, the computer software product is stored in a storage medium, and includes several instructions for Make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks. The above is only a specific implementation of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope of the present disclosure. should fall within the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be determined by the protection scope of the claims.

Industrial Applicability

Embodiments of the present disclosure provide a video memory optimization method, device, device, storage medium, and program product, wherein the method includes: generating a first calculation graph based on a preset network model; determining a video memory peak value of the first calculation graph The association relationship with the running data; based on the association relationship, adjust the first calculation graph to generate at least one second calculation graph; based on the video memory peak value and runtime of the at least one second calculation graph, in A target computation graph is determined in the at least one second computation graph; based on the target computation graph, the display memory space required by the preset network model is determined.

Claims

A video memory optimization method, said method comprising:

generating a first calculation graph based on a preset network model;

Determine the correlation between the peak value of the video memory of the first calculation graph and the running data;

Adjusting the first calculation graph based on the association relationship to generate at least one second calculation graph;

Determining a target computation graph in the at least one second computation graph based on the video memory peak value and runtime of the at least one second computation graph;

Based on the target computation graph, the video memory space required by the preset network model is determined.
The method according to claim 1, wherein said generating a first calculation graph based on a preset network model comprises:

Generate calculation graph information in a data exchange format based on the preset network model;

Based on the operator queue in the computation graph information, a first computation graph matching the computation graph information is generated.
The method according to claim 1 or 2, wherein said determining the correlation between the peak value of the video memory of the first calculation graph and the running data comprises:

Determining the occurrence moment of the video memory peak in the first calculation graph;

determining the operation data of the operator in the first calculation graph;

determining the generation time of the operation data and the application time of the operation data in the first calculation graph;

A timing relationship between the generation time and the application time and the occurrence time of the video memory peak value is determined as the association relationship.
The method according to any one of claims 1 to 3, wherein, based on the association relationship, adjusting the first calculation graph to generate at least one second calculation graph includes:

In the first calculation diagram, determine that the association relationship satisfies the target operating data of the preset condition;

Based on the target operation data, the first calculation graph is adjusted to generate the at least one second calculation graph.
The method according to claim 4, wherein, in the first calculation graph, determining that the association relationship meets the target operating data of a preset condition includes:

In the operation data of the operators in the first calculation graph, determine the operation data whose generation time is before the occurrence time of the video memory peak and the application time is after the occurrence time of the video memory peak, in order to satisfy the preset Target run data for the condition.
The method according to claim 4 or 5, wherein said adjusting said first calculation graph based on said target operation data to generate said at least one second calculation graph comprises:

In the first calculation graph, determine a target operator corresponding to the target operating data;

Based on the occurrence time of the video memory peak in the first computation graph, the target operator in the first computation graph is adjusted to generate the at least one second computation graph.
The method according to claim 6, wherein, based on the occurrence time of the video memory peak value in the first calculation graph, the target operator in the first calculation graph is adjusted to generate the at least one first Two calculation graphs, including:

In the first computation graph, the second computation graph is generated after the execution time of the target operator is adjusted to the occurrence time of the video memory peak.
The method according to any one of claims 1 to 7, wherein the determination of the target calculation graph in the at least one second calculation graph based on the peak value of the video memory and the running time of the at least one second computation graph includes :

Obtaining a preset video memory overhead and a preset trade-off ratio; wherein, the preset trade-off ratio is used to weigh the ratio between the running time of the calculation graph and the required video memory;

Based on the preset video memory overhead and the preset trade-off ratio, score the video memory peak value, running time of each second computing graph, and the running time of the corresponding first computing graph, and obtain each second Calculation graph scoring results;

Based on the scoring results of each second computation graph, sort the second computation graphs in the at least one second computation graph to obtain a sorted queue;

Based on the sorted queue, the target computation graph is determined in the at least one second computation graph.
The method according to claim 8, wherein, based on the preset video memory overhead and the preset trade-off ratio, the video memory peak value, runtime of each second computing graph and the corresponding first computing graph The running time is scored, and the scoring results of each second calculation graph are obtained, including:

Based on the video memory peak value of each second computing graph, the preset video memory overhead and the preset trade-off ratio, determine the video memory score of each second computing graph;

determining a runtime score of each second computation graph based on the preset trade-off ratio, the runtime of each second computation graph, and the runtime of the corresponding first computation graph;

A scoring result of each second computing image is determined based on the video memory score and the running time score of each second computing graph.
The method according to claim 8 or 9, wherein said determining said target computation graph in said at least one second computation graph based on said sorting queue comprises:

In the arrangement queue of the at least one second computation graph, search for the first candidate computation graph with the best scoring result;

In response to the searched video memory space required by the first candidate computation graph meeting the preset video memory overhead, determine the first candidate computation graph as the target computation graph.
The method according to claim 10, wherein said determining said target computation graph in said at least one second computation graph based on said sorting queue comprises:

In response to the fact that the video memory space required by the first candidate computing graph does not meet the preset video memory overhead, based on the target running data of the first candidate computing graph, the first candidate computing graph is adjusted to obtain at least one The third calculation graph;

updating the permutation queue based on the scoring results of the at least one third calculation graph to obtain an updated permutation queue;

In the updated queue, search whether the video memory space required by the second candidate calculation graph with the best scoring result satisfies the preset video memory overhead;

In response to the fact that the video memory space required by the second candidate computation graph does not meet the preset video memory overhead, and the number of searches reaches a preset threshold, it is determined that the computation graph with the best scoring result in the queue corresponding to the last search is the Target Computational Graph.
The method according to any one of claims 1 to 11, wherein the determining the video memory space required by the preset network model based on the target calculation graph includes:

The video memory space required by the target computation graph is determined as the video memory space required for training the preset network model.
A video memory optimization device, wherein the device includes:

The first generation module is configured to generate a first calculation graph based on a preset network model;

The first determination module is configured to determine the correlation between the peak value of the video memory of the first calculation graph and the running data;

The second generation module is configured to adjust the first calculation graph based on the association relationship, and generate at least one second calculation graph;

The second determination module is configured to determine a target calculation graph in the at least one second calculation graph based on the peak value of the video memory and the running time of the at least one second calculation graph;

The third determination module is configured to determine the video memory space required by the preset network model based on the target computation graph.
The apparatus according to claim 13, wherein the first generating part comprises:

The first generation subpart is configured to generate calculation graph information in a data exchange format based on the preset network model;

The second generation subpart is configured to generate a first computation graph matched by the computation graph information based on the operator queue in the computation graph information.
The device according to claim 13 or 14, wherein the first determining part comprises:

The first determination subpart is configured to determine the occurrence moment of the video memory peak in the first calculation graph;

The second determination subpart is configured to determine the operation data of the operator in the first calculation graph;

The third determination subpart is configured to determine the generation time of the operation data and the application time of the operation data in the first calculation graph; determine the generation time and the application time, and the peak value of the video memory The timing relationship between the occurrence moments of is the association relationship.
The device according to any one of claims 13 to 15, wherein the second generating part comprises:

The fourth determination subpart is configured to determine, in the first calculation graph, the target operating data whose association relationship satisfies a preset condition;

The first adjustment subpart is configured to adjust the first calculation graph based on the target operation data to generate the at least one second calculation graph.
The apparatus according to claim 16, wherein the fourth determining subsection comprises:

The first determining unit is configured to, in the operation data of the operator in the first calculation graph, determine the operation whose generation time is before the occurrence time of the video memory peak value and whose application time is after the occurrence time of the video memory peak value The data is the target operation data satisfying the preset condition.
A computer storage medium, wherein computer executable instructions are stored on the computer storage medium, and after the computer executable instructions are executed, the video memory optimization method described in any one of claims 1 to 12 can be implemented.
A computer device, wherein the computer device includes a memory and a processor, the memory has computer-executable instructions stored thereon, and the processor is capable of implementing claims 1 to 12 when running the computer-executable instructions on the memory The video memory optimization method described in any one.
A computer program product, the computer program product comprising a computer program or an instruction, when the computer program or instruction is run on an electronic device, the electronic device is made to execute any one of claims 1 to 12 memory optimization method.