CN116974772B

CN116974772B - Resource optimization and carbon emission reduction method and equipment for large language model

Info

Publication number: CN116974772B
Application number: CN202311224175.6A
Authority: CN
Inventors: 闫月君; 王朝阳; 刘文宇
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2024-02-27
Anticipated expiration: 2043-09-21
Also published as: CN116974772A

Abstract

The embodiment of the application provides a method and equipment for optimizing resources and reducing carbon emission aiming at a large language model. Aiming at the pre-training process of the large language model, based on the carbon emission intensity dynamically changed in the selected time interval, determining a pre-training start-stop time node in the time interval and GPU resource quantity regulation parameter values in the pre-training period for the large language model, so that the pre-training carbon emission estimated total quantity corresponding to the large language model in the time interval reaches the specified requirement. In this way, the GPU resource quantity regulation parameter values under the pre-training start-stop time node and each time node in the pre-training period can be regulated and controlled reasonably based on the carbon emission intensity in the selected time interval, so that the GPU resource quantity used for the pre-training of the large language model is regulated and controlled dynamically, the pre-training task is further segmented reasonably to the relevant time nodes in the selected time interval, and the carbon emission quantity generated by the pre-training of the large language model can be reduced effectively.

Description

Resource optimization and carbon emission reduction method and equipment for large language model

Technical Field

The application relates to the technical field of computers, in particular to a method and equipment for optimizing resources and reducing carbon emission aiming at a large language model.

Background

With the release of various types of generative AI, large language models (Large Language Model, LLM) are becoming research hotspots and beginning to be widely used in various fields. In recent years, the parameter scale of a large language model is continuously increased, and the model effect is greatly improved, however, a larger model means larger calculation resource requirements, and further means larger energy requirements and more carbon emission.

At present, aiming at the problems of energy consumption and carbon emission optimization of a large language model, the existing research scheme mostly reduces the calculation energy consumption by means of power consumption limitation of a chip, optimization of a neural network model structure or selection of a processor with lower energy consumption and higher efficiency, but the existing research scheme does not deeply research a low-carbon optimization regulation strategy when the large language model is pre-trained on a GPU.

Therefore, there is a need to provide a better carbon abatement scheme for large language models.

Disclosure of Invention

Aspects of the present application provide a method and apparatus for optimizing resources and reducing carbon emissions for a large language model, so as to better reduce carbon emissions generated by pretraining the large language model by optimizing and controlling GPU resources.

The embodiment of the application provides a resource optimization and carbon emission reduction method for a large language model, which comprises the following steps:

Selecting a time interval for developing pre-training for a large language model to be processed;

determining GPU resource quantity regulation parameter values under pre-training start-stop time nodes in the time interval and all time nodes in the pre-training period for the large language model based on the carbon emission intensity dynamically changed in the time interval, so that the pre-training carbon emission estimated total quantity corresponding to the large language model in the time interval reaches a specified requirement;

and dynamically regulating and controlling the GPU resource quantity used for pre-training the large language model in the time interval according to the pre-training start-stop time node and the GPU resource quantity regulating and controlling parameter values.

Embodiments of the present application also provide a computing device including a memory and a processor;

the memory is used for storing one or more computer instructions;

the processor is coupled to the memory for executing the one or more computer instructions for performing the resource optimization and carbon emission reduction methods described above for the large language model.

Embodiments also provide a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the foregoing resource optimization and carbon emission reduction method for large language models.

In the embodiment of the application, for a pre-training process of a large language model, determining a pre-training start-stop time node in a time interval and GPU resource quantity regulation parameter values under each time node in the pre-training period for the large language model based on the carbon emission intensity dynamically changed in the selected time interval, so that the pre-training carbon emission estimated total quantity corresponding to the large language model in the time interval reaches a specified requirement; based on the above, the GPU resource amount used for pre-training the large language model can be dynamically regulated and controlled in the time interval according to the pre-training start-stop time node and the regulating parameter values of each GPU resource amount determined in the foregoing. In this way, the GPU resource quantity regulation parameter values under the pre-training start-stop time node and each time node in the pre-training period can be regulated and controlled reasonably based on the carbon emission intensity in the selected time interval, so that the GPU resource quantity used for the pre-training of the large language model is regulated and controlled dynamically, the pre-training task is further segmented reasonably to the relevant time nodes in the selected time interval, and the carbon emission quantity generated by the pre-training of the large language model can be reduced effectively.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a schematic flow chart of a method for optimizing resources and reducing carbon emissions for a large language model according to an exemplary embodiment of the present application;

FIG. 2 is a logic diagram of a method for resource optimization and carbon emission reduction for large language models according to an exemplary embodiment of the present application;

fig. 3 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method for resource optimization and carbon emission reduction for large language models, which may be performed by a data processing apparatus, which may be implemented as a combination of software and/or hardware, which may be integrated in a computing device, according to an exemplary embodiment of the present application.

The resource optimization and carbon emission reduction method for the large language model provided by the embodiment can be applied to various scenes in which carbon emission reduction is required to be carried out on the large language model, and the embodiment is not limited to the application scenes. In general, the large language model needs to be deployed in a computing system including numerous servers, where in practical application, the computing system may be a cloud data center, in this case, a computing device for executing the resource optimization and carbon emission reduction method for the large language model in this embodiment may be a resource scheduling device deployed in the data center, and the resource optimization and carbon emission reduction method for the large language model provided in this embodiment may be a newly added function module in the resource scheduling device deployed in the data center. Of course, the computing system in this embodiment may also be other types of systems, such as an enterprise room, etc. The computing device in this embodiment may be a device in the computing system that has global GPU resource management rights. In addition, the computing device in this embodiment may be a single server, or may be a server cluster, and the physical implementation form of the computing device is not limited in this embodiment.

The inventors found during the course of the study that carbon emissions from large language models were mainly produced in three phases: firstly, carbon emission in the equipment manufacturing stage, secondly, carbon emission in the model pre-training stage, and thirdly, carbon emission in the model application and reasoning stage. Wherein the carbon emissions of the device manufacturing stage are contained in hardware devices such as servers, and the carbon emissions of the device manufacturing stage are determined. The energy consumption of the model application and reasoning stage is then closely related to the size and access requirements of the user. Because the problem needs of users need to respond in time, the carbon emission is difficult to realize optimal control. While the pre-training phase of large language models requires greater computational resources, it typically has a longer training period (varying from weeks to months) and is more time intensive than the application phase. Therefore, in this embodiment, it is proposed to explore how to reduce the carbon emissions generated by the pretraining of the large language model in the pretraining stage of the large language model.

FIG. 2 is a logic diagram of a method for resource optimization and carbon emission reduction for a large language model according to an exemplary embodiment of the present application. Referring to FIG. 2, GPU resources may be used in the present embodiment to carry pre-training tasks for large language models. The GPU resources may be assembled on a server, and in this embodiment, the method provided in this embodiment may be executed by a computing device to centrally manage all GPU resources in the system.

Referring to fig. 1, the method for optimizing resources and reducing carbon emissions for a large language model provided in this embodiment may include:

step 100, selecting a time interval for developing pre-training for a large language model to be processed;

step 101, determining GPU resource quantity regulation parameter values under pre-training start-stop time nodes in a time interval and all time nodes in a pre-training period for a large language model based on the carbon emission intensity dynamically changed in the time interval, so that the pre-training carbon emission estimated total quantity corresponding to the large language model in the time interval reaches a specified requirement;

and 102, dynamically regulating and controlling the GPU resource amount for pre-training the large language model in a time interval according to the pre-training start-stop time node and each GPU resource amount regulating and controlling parameter value.

In practical applications, a time delay threshold is usually specified for the pre-training task of the large language model, and in step 100, the time interval selected for developing the pre-training for the large language model should be greater than the preset time delay threshold, and of course, more filtering conditions may be added to select a suitable time interval for the large language model to be processed, where the filtering conditions are not exhaustive.

On this basis, resource optimization and carbon emission reduction schemes for large language models may be determined over a selected time interval.

With continued reference to fig. 1, in step 101, GPU resource amount adjustment parameter values for the pre-training start-stop time nodes within the time interval and for each time node during the pre-training may be determined for the large language model. It should be appreciated that step 101 is a parameter setting step that is completed prior to performing the GPU resource conditioning operation. That is, before step 102 is performed, parameters used in performing the GPU resource adjustment operation are determined based on step 101, so that the GPU resource adjustment operation in step 102 may be performed according to the parameters determined in advance in step 102.

In this embodiment, the unit corresponding to the time node defined in the time interval is not limited, and the unit corresponding to the time node in this embodiment may be days, weeks, or hours, which is not meant to be exhaustive.

In this embodiment, it is proposed that, based on the carbon emission intensity dynamically changed in the selected time interval, GPU resource amount control parameter values under the pre-training start-stop time node in the selected time interval and each time node during the pre-training period may be determined for the large language model, and the GPU resource amount control parameter values under the pre-training start-stop time node and each time node during the pre-training period determined for the large language model need to enable the pre-training carbon emission estimated total amount corresponding to the large language model in the selected time interval to reach the specified requirement. The pre-training carbon emission estimated total amount refers to the carbon emission estimated total amount required to be generated after the pre-training of the large language model is completed. The specified requirements may be that the pre-trained carbon emissions estimate be the lowest in total, etc.

Here, the carbon emission intensity refers to an amount of carbon emission required to produce one unit of energy, for example, in the present embodiment, the GPU resource consumes electric energy, and thus, the carbon emission intensity herein may refer to an amount of carbon emission required to produce one degree of electric energy. In practice, a wide variety of energy sources are used to produce energy, including but not limited to light, wind, water, coal, and the like. The carbon emission intensity corresponding to each of the different types of energy sources is different, and the energy source proportion relationship is also dynamically changed, which results in that the carbon emission intensity corresponding to the electric energy used by the GPU resource at different time nodes is also dynamically changed. In this embodiment, the carbon emission intensity at each time node in the selected time interval may be determined according to the energy usage of the electric energy docked by the computing system in which the large language model is located at different time nodes. Alternatively, a weighted average may be calculated for the carbon emission intensity of each type of energy source used at a single time node as the carbon emission intensity at the corresponding time node in step 101. In addition, in the present embodiment, the carbon emission intensity at each time node used in step 101 is a predicted value, and the present embodiment is not limited to the prediction scheme adopted, and any scheme that can predict the carbon emission intensity of a future time node occurring now or in the future may be adopted in the present embodiment.

The GPU resource amount regulation parameter value in the embodiment is used for controlling the GPU resource amount used for performing the large language model pretraining under the corresponding time node. That is, it can be used as a basis for regulating the amount of GPU resources and dynamically changes in a selected time interval. The pre-training start-stop time node in the embodiment is used as a control basis for the start-stop state of the large language model pre-training task. The parameters used to characterize the pre-training start-stop time node in this embodiment may be varied, as will be exemplified later. The GPU resource amount adjusting parameters that can be used in the present embodiment will also be described later, and are not limited herein.

After the parameter determination process in step 101 is completed, the execution of the pre-training task for the large language model may be started. Referring to fig. 1, in step 102, relevant parameter values may be determined in step 101, and a task start-stop state of a large language model pre-training task may be controlled in a selected time interval, and an amount of GPU resources used for pre-training at each time node during the pre-training period may be controlled. It should be appreciated that in practical applications, the amount of GPU resources for pre-training will not be provided during non-pre-training.

In this way, the GPU resource quantity used for the large language model pre-training at different time nodes can be dynamically regulated and controlled in the selected time interval by combining the pre-training start-stop time nodes and the GPU resource quantity regulating and controlling parameter values under all time nodes in the pre-training period, and the regulating and controlling can enable the pre-training carbon emission estimated total quantity corresponding to the large language model in the selected time interval to reach the specified requirement.

To sum up, in this embodiment, for a pre-training process of a large language model, based on a carbon emission intensity dynamically changed in a selected time interval, GPU resource amount regulation parameter values at a pre-training start-stop time node in the time interval and at each time node in a pre-training period are determined for the large language model, so that a pre-training carbon emission estimated total amount corresponding to the large language model in the time interval reaches a specified requirement; based on the above, the GPU resource amount used for pre-training the large language model can be dynamically regulated and controlled in the time interval according to the pre-training start-stop time node and the regulating parameter values of each GPU resource amount determined in the foregoing. In this way, the GPU resource quantity regulation parameter values under the pre-training start-stop time node and each time node in the pre-training period can be regulated and controlled reasonably based on the carbon emission intensity in the selected time interval, so that the GPU resource quantity used for the pre-training of the large language model is regulated and controlled dynamically, the pre-training task is further segmented reasonably to the relevant time nodes in the selected time interval, and the carbon emission quantity generated by the pre-training of the large language model can be reduced effectively.

In the above or the following embodiments, in step 101, the problem of determining GPU resource amount control parameter values at the pre-training start-stop time nodes in the time interval and each time node during the pre-training for the large language model may be converted into a solution problem for the objective function by constructing the objective function.

In step 101, the pre-training carbon emission estimated total amount corresponding to the large language model in the selected time interval can reach the specified requirement, and an objective function is constructed, wherein the product of the carbon emission intensity and the energy consumption total amount in a single time node in the time interval is used for representing the carbon emission estimated amount in the corresponding time node in the objective function, and the energy consumption total amount in the single time node is used for using the GPU resource amount regulation parameter and the parameter for representing the pre-training start-stop time node as the influence factors; and solving an objective function to determine a pre-training start-stop time node and each GPU resource quantity regulation parameter value.

As mentioned above, the carbon emission intensity at a single time node in the present embodiment may refer to the carbon emission amount required to generate one degree of electric energy, and the total amount of energy consumption at a single time node in the objective function may refer to the degree of electric energy consumed by the large language model at a single time node. Thus, the product of the two can represent the predicted amount of carbon emissions at a single time node. Thus, in the objective function, the carbon emission pre-estimates at each time node during the pre-training period may be summed to characterize the corresponding pre-trained carbon emission pre-estimated total amount of the large language model over the selected time interval.

In addition, parameters to be solved in the objective function may include parameters for characterizing the pre-training start-stop time nodes and GPU resource amount regulation parameters. The two types of parameters are used as the influence factors of the total energy consumption under a single time node in the objective function, so that the parameter values corresponding to the parameters to be solved can be obtained by solving the objective function.

Two optional GPU resource amount adjusting parameters are provided below, and the resource optimization and carbon emission reduction schemes for the large language model provided in this embodiment are described with respect to the two GPU resource amount adjusting parameters respectively.

1. The GPU resource amount adjustment parameters in this embodiment may include GPU parallelism.

The GPU parallelism may refer to the number of GPUs running at a single time node.

In this case, one exemplary scheme for constructing the objective function is: the method comprises the steps that a first energy consumption value caused by a single GPU in an operation state under a single time node and a second energy consumption value caused by the single GPU in an idle state under the single time node can be obtained; characterizing energy consumption generated on the GPU under the corresponding time node in the time interval based on the GPU parallelism under the single time node, the first energy consumption value and the second energy consumption value; and characterizing the total energy consumption under the corresponding time node by combining the parameters for characterizing the pre-training start-stop time node and the energy consumption generated on the GPU under the characterized single time node so as to construct an objective function.

In the exemplary scheme, the method comprises the steps of representing the total energy consumption under a single time node based on the GPU parallelism and parameters for representing the pre-training start-stop time node, so as to embody the influence of the GPU parallelism and the pre-training start-stop time node on the total energy consumption in an objective function. Wherein a first power consumption value caused by a single GPU in an operational state under a single time node and a second power consumption value caused by a single GPU in an idle state under a single time node are introduced.

Based on this, in this exemplary scenario, an exemplary expression of an objective function may be:

wherein,represents the carbon emission intensity of the t-time node, +.>Representing the GPU parallelism at the t-time node,representing a first energy consumption value,/->Representing the self-energy consumption value of the GPU under a single time node when the GPU is in an operating state, +.>Then representing the energy consumption value of the single time node caused by other parts on the server where the GPU is located when the GPU is in an operating state; />Indicating the number of idle GPUs at the t-time node, N being the total number of GPUs, and accordingly,representing a second energy consumption value,/->Representing the self-energy consumption value of the GPU in an idle state under a single time node, +. >Then representing the power consumption value of a single time node caused by other parts of the GPU on the server where it is located while the GPU is in an idle state; />And->For characterizing the parameters of the pre-training start-stop time node, wherein +.>Indicating whether the t-time node is during pre-training, < ->Indicating whether the t-time node is caused to generate a server startup event due to the start of pre-training, and +.>The method comprises the steps of representing a starting-up energy consumption value caused by a starting-up event of a server; t represents the task delay corresponding to the pre-trained start-stop time node.

Above mentionedAnd->Empirical values may be used for different types of GPU +.>It is possible to take on values that are different,different values are also possible. In addition, a->And->The ratio is usually +.>In practical application, the ++can also be used>Take the value as the empirical value by setting +.>To calculate +.>. In this case, the ++can be set as needed according to the influence factors such as the GPU model>Is a value of (a). For example, a->The value is 3, etc., and is not limited herein.

In addition, in the practical application,the value of (2) may be 0 or 1, (-)>A1 indicates that the t-time node is in the pre-training period, < >>A value of 0 indicates that the node is not during pre-training at time t; />Has a value of 0 or 1,/o>1 indicates that a server power-on event occurs at the node at time t due to pre-training initiation,/- >A0 indicates that a server power-on event has not occurred at time t.

It should be understood that the above-mentioned expression of the objective function is only exemplary, and other expressions may be used in the present embodiment to reflect the influence of the GPU parallelism and the pre-training start-stop time node on the total energy consumption. For example, only considerations may be taken into account in the characterization of the first energy consumption valueWithout being considered->. For another example, only +_may be considered in the characterization of the second energy consumption value>Without being considered->. For another example, in addition to the foregoing self power consumption value, the power consumption value of other parts on the server, and the power on power consumption value, the power consumption value of other dimensions that may be affected by the GPU parallelism or the pre-training start-stop time node may be increased, so as to synthesize the power consumption values of multiple dimensions to represent the total amount of power consumption under a single time node. Also for example, there is no need to use +.>The task start-stop state is characterized under each time node, T is not required to be used as the upper limit value of T, and a plurality of values of T can be directly limited, namely, each time node directly in the pre-training period is used as the value of T in the objective function. No further examples are presented here.

It can be known that, in this embodiment, the energy consumption values of various dimensions that may be caused by performing the pre-training of the large language model under a single time node may be considered in the objective function, and the GPU parallelism or the influence of the pre-training start-stop time node on the energy consumption values of the corresponding dimensions may be reflected, so as to characterize the total energy consumption under the single time node. In this way, the objective function is solved by combining the carbon emission intensity dynamically changed in the selected time interval, so that the GPU parallelism degree in the time interval of pre-training start-stop time nodes and each time node in the pre-training period can be determined for the large language model.

On the basis, the GPU parallelism can be dynamically regulated and controlled in a selected time interval according to the solved starting and ending time nodes of the pre-training and the GPU parallelism under each time node in the pre-training period, so that the GPU resource regulation and control can be realized.

2. The GPU resource amount adjustment parameters in this embodiment may include the frequency and voltage of the GPU.

The GPU frequency may refer to a GPU master frequency, i.e., a clock frequency. Clock frequency refers to the frequency at which a clock generator of a processor generates pulses that are used to synchronize the operation of processor components and are used as an indicator of processor speed. Measured in hertz (Hz). In general, a higher frequency means a faster computing speed of the processor. GPU voltage refers to the voltage allocated to a single GPU at a single time node. For ease of computation, in this embodiment, at the same time node, all GPUs in the computing system may be run on the same frequency and voltage by default.

In this case, one exemplary scheme for constructing the objective function is: characterizing energy consumption generated on the GPU under a single time node based on the GPU frequency and voltage under the single time node; and characterizing the total energy consumption under the corresponding time node by combining the parameters for characterizing the pre-training start-stop time node and the energy consumption generated on the GPU under the characterized single time node so as to construct an objective function.

Wherein optionally, in this exemplary scenario, a relationship between the energy consumption generated on the GPU at a single time node and the GPU frequency and voltage may be determined according to a dynamic voltage and frequency scaling principle (DVFS), and in accordance with this relationship, in the objective function, the energy consumption generated on the GPU at a single time node is characterized based on the GPU frequency and voltage at a single time node. Under the dynamic voltage and frequency adjustment principle, the energy consumption generated on the GPU under a single time node is influenced by the frequency and the voltage of the GPU at the same time, and has a proportional relation with the frequency and the voltage of the GPU.

In the exemplary scheme, the total energy consumption under a single time node is represented based on GPU frequency and voltage and parameters for representing the pre-training start-stop time node, so that the influence of the GPU frequency and voltage and the pre-training start-stop time node on the total energy consumption is represented in an objective function.

and is also provided with；/>

Wherein,represents the GPU frequency at the t-time node, +.>Representing GPU voltage at a t-time node, and N represents the total number of GPUs; />Representing the self-energy consumption value of a single GPU at the t-time node, >Representing the power consumption value caused by the single GPU at the t-time node at the other part of the server where it is located,/->Representation->And->The proportional relation between the two; />And->For characterizing the parameters of the pre-training start-stop time node, wherein +.>Indicating whether the t-time node is during pre-training, < ->Indicating whether the t-time node is caused to generate a server startup event due to the start of pre-training, and +.>The method comprises the steps of representing a starting-up energy consumption value caused by a starting-up event of a server; t represents the task delay corresponding to the pre-training start-stop time node, < >>Can be regarded as a constant.

Also, in the practical application,the value of (2) may be 0 or 1, (-)>A1 indicates that the t-time node is in the pre-training period, < >>A value of 0 indicates that the node is not during pre-training at time t; />Has a value of 0 or 1,/o>1 indicates that a server power-on event occurs at the node at time t due to pre-training initiation,/->A0 indicates that a server power-on event has not occurred at time t.

In this exemplary expression, the relationship between the energy consumption generated on the GPU at the t-time node and the GPU frequency and voltage is characterized as follows, according to the dynamic voltage and frequency adjustment principle pairBased on->And->Proportional relation between them, it is possible to add +. >Characterized by:

this causes the objective function to change to have the GPU frequency and voltage and the parameters used to characterize the pre-training start-stop time nodes as parameters to be solved.

It should be understood that the above-described expressions of objective functions are exemplary only, the presentOther expressions may be used in embodiments to reflect the impact of GPU frequency and voltage and pre-training start-stop time nodes on the total amount of energy consumption. For example, consider onlyWithout being considered->. For another example, in addition to the foregoing self energy consumption value, the energy consumption value generated by other parts on the server, and the power on energy consumption value, more energy consumption values of other dimensions that may be affected by GPU frequency, GPU voltage, or pre-training start-stop time nodes may be added, so as to synthesize energy consumption values of multiple dimensions to characterize the total energy consumption under a single time node. Also for example, there is no need to use +.>The task start-stop state is characterized under each time node, T is not required to be used as the upper limit value of T, and a plurality of values of T can be directly limited, namely, each time node directly in the pre-training period is used as the value of T in the objective function. No further examples are presented here.

It can be known that, in this embodiment, the energy consumption values of various dimensions that may be caused by performing the pre-training of the large language model under a single time node may be considered in the objective function, and the influence of the GPU frequency, the GPU voltage, or the pre-training start-stop time node on the energy consumption values of the corresponding dimensions is reflected, so as to characterize the total energy consumption under the single time node. In this way, the objective function is solved by combining the dynamically changed carbon emission intensity in the selected time interval, so that the GPU frequency and voltage at the pre-training start-stop time node in the time interval and each time node in the pre-training period can be determined for the large language model.

On the basis, the GPU frequency and the voltage can be dynamically regulated and controlled in a selected time interval according to the solved starting and ending time nodes of the pre-training and the GPU frequency and the voltage under each time node in the pre-training period, so that the regulation and control of GPU resources are realized.

In the present embodiment, the two exemplary GPU resource amount adjustment parameters are provided, and it should be understood that other types of GPU resource amount adjustment parameters may be used in the present embodiment to dynamically adjust the amount of GPU resource provided to the large language model in a selected time interval.

Further, in this embodiment, in order to improve accuracy of GPU parallelism at the pre-training start-stop time node and each time node during the pre-training, which are solved for the large language model in this exemplary solution, constraint conditions may also be set for the objective function.

The constraint conditions set for the objective function in this embodiment may include, but are not limited to, the task delay corresponding to the pre-training start-stop time node not exceeding a preset delay threshold; and the amount of tasks that can be completed in the time interval is not less than the total amount of the designated tasks, etc.

The task delay corresponding to the pre-training start-stop time node may be T in an exemplary expression of applying the objective function, and the preset delay threshold refers to the maximum delay allowed by the large language model. By the constraint condition that the task time delay corresponding to the pre-training start-stop time node does not exceed the preset time delay threshold, the pre-training start-stop time node can be more reasonably determined, and the large language model is ensured to be completed on time.

In addition, the inventors have found during the course of the study that there is a linear functional relationship between the individual GPU's own power consumption value at a single time node and the amount of tasks that it can perform.

In the case where the GPU resource amount regulation parameters include GPU parallelism, such a linear function relationship may be characterized as:the task amount that can be completed after GPU resource control in this case can be characterized as follows:

in this way, the constraint condition that the task amount which can be completed after GPU resource regulation is performed according to the GPU parallelism at each time node as a large language model is not lower than the total amount of the designated tasks can be characterized as follows:

wherein,representing the total amount of tasks specified previously. In this case, the constraint condition is related to the GPU parallelism, so that the GPU parallelism can be constrained to be valued based on the constraint condition, so that the final value of the GPU parallelism is more reasonable.

In the case where the GPU resource amount adjustment parameters include GPU parallelism, this linear function relationship may be characterized as:and->The task amount that can be completed after GPU resource control in this case can be characterized as follows:

in this way, the constraint condition that the amount of tasks which can be completed after GPU resource regulation is performed according to the large language model with GPU frequency and voltage at each time node is not lower than the total amount of designated tasks can be characterized as follows:

In this case, the constraint condition is related to the frequency of the GPU, so that the GPU frequency can be constrained to be valued based on the constraint condition, so that the final value of the GPU frequency is more reasonable.

Of course, in addition to the two exemplary constraints described above, in this embodiment, other constraints may be set for the objective function.

For example, in the case where the GPU resource amount adjustment parameters include GPU parallelism, the GPU parallelism at each time node during the pre-training may also be set to be not higher than the total number of GPUs. The method can be specifically characterized as follows:，/>when the number of the organic light emitting diode is 0,the constraint is 0.

For another example, in the case where the GPU resource amount adjustment parameter includes GPU parallelism, it may also be set that the own energy consumption value of the single GPU at a single time node during pre-training is within the upper and lower limits of the energy consumption supported by the GPU. The method can be specifically characterized as follows:wherein->When 0, the drug is added>The constraint is 0.

For another example, in the case where the GPU resource amount adjustment parameter includes GPU parallelism, the GPU frequency at a single time node may also be set to be within the range of the upper and lower frequency limits supported by the GPU, and the voltage at the single time node is set to be within the range of the upper and lower voltage limits supported by the GPU. The method can be specifically characterized as follows: ；/>。

In addition, no matter what kind of parameters are adopted by the GPU resource quantity regulation parameters, constraint conditions set for the objective function can further comprise that the pre-training start-stop time nodes are consistent with start-stop time of a server where the GPU is located, and task start-stop occurs only once in a time interval. The constraint can be characterized specifically as:

and, in addition, the method comprises the steps of,

wherein,it can be indicated if a server shutdown event has occurred at the t-time node due to the end of the pretraining, +.>The value of (2) may be 0 or 1, wherein, < ->1 indicates that a server shutdown event occurred at the t-time node,/>A 0 indicates that a server shutdown event did not occur at the time t. By this constraint, only one pre-training start-stop and only one server start-stop can be constrained to occur within the selected time interval, and both remain consistent. The constraint condition can reasonably constrain the pre-training start-stop time nodes.

It should be noted that the constraint conditions in the present embodiment are not limited thereto, and other constraint conditions may be set without being limited thereto, and no further examples are given herein.

In this way, in this embodiment, the objective function may be solved based on the constraint condition, so as to obtain a solution result, where the solution result includes parameter values corresponding to parameters for characterizing the start-stop time node of the pre-training and GPU resource amount regulation parameter values under each time node during the pre-training. Through the constraint conditions, GPU resource quantity regulation and control parameter values under the pre-training start-stop time nodes in the time interval and each time node in the pre-training period can be determined more reasonably for the large language model, so that GPU resource quantity regulation and control can be performed more reasonably, and better carbon emission reduction effect can be obtained.

It should be noted that, in some of the above embodiments and the flows described in the drawings, a plurality of operations appearing in a specific order are included, but it should be clearly understood that the operations may be performed out of the order in which they appear herein or performed in parallel, the sequence numbers of the operations, such as 101, 102, etc., are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any execution order. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different application ends, energy consumption values, and the like, and do not represent the sequence, and are not limited to the "first" and "second" being different types.

Fig. 3 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application. As shown in fig. 3, the computing device includes: a memory 30 and a processor 31.

A processor 31 coupled to the memory 30 for executing the computer program in the memory 30 for:

and dynamically regulating and controlling the GPU resource quantity used for pre-training the large language model in the time interval according to the pre-training start-stop time node and the regulating and controlling parameter values of each GPU resource quantity.

In an alternative embodiment, processor 31 is configured to determine GPU resource amount adjustment parameter values for the large language model at pre-training start-stop time nodes in the time interval and at various time nodes during the pre-training based on the dynamically changing carbon emission intensity in the time interval:

constructing an objective function by taking the pre-training carbon emission estimated total amount corresponding to the large language model in the time interval as a target, wherein the product of the carbon emission intensity and the energy consumption total amount in a single time node in the time interval is used for representing the carbon emission estimated amount in a corresponding time node in the objective function, and the energy consumption total amount in the single time node is used for taking GPU resource amount regulation parameters and parameters for representing pre-training start-stop time nodes as influence factors;

And solving the objective function to determine the pre-training start-stop time node and the value of each GPU resource quantity regulation parameter.

In an alternative embodiment, the GPU resource amount adjustment parameters include GPU parallelism, and when the processor 31 targets that the total pre-trained carbon emission estimated amount corresponding to the large language model in the time interval reaches the specified requirement, the processor may be configured to:

acquiring a first energy consumption value caused by a single GPU in an operation state under a single time node and a second energy consumption value caused by the single GPU in an idle state under the single time node;

characterizing energy consumption generated on the GPU under the corresponding time node in the time interval based on the GPU parallelism under the single time node, the first energy consumption value and the second energy consumption value;

and combining parameters for representing the pre-training start-stop time nodes and the energy consumption generated on the GPU under the represented single time node, and representing the total energy consumption under the corresponding time node to construct the objective function.

In an alternative embodiment, the objective function may be characterized as:

wherein,represents the carbon emission intensity of the t-time node, +. >Representing the GPU parallelism at the t-time node,representing said first energy consumption value, < >>Representing the self-energy consumption value of the GPU under a single time node when the GPU is in an operating state, +.>Then representing the power consumption value of the other part on the server where the GPU is located at a single time node when the GPU is in an operational state; />Indicating the number of idle GPUs at the t time node, N being the total number of GPUs, correspondingly, < ->Representing said second energy consumption value, +.>Representing the self-energy consumption value of the GPU in an idle state under a single time node, +.>Then representing the power consumption value of the other part on the server where the GPU is located at a single time node when the GPU is in an idle state; />Andfor characterizing the parameters of the pre-training start-stop time node, wherein +.>Indicating whether the t-time node is atDuring pre-training, the person is strapped with->Indicating whether the t-time node is caused to generate a server startup event due to the start of pre-training, and +.>The method comprises the steps of representing a starting-up energy consumption value caused by a starting-up event of a server; t represents the task delay corresponding to the pre-training start-stop time node.

In an alternative embodiment, the GPU resource amount adjustment parameters include frequency and voltage of the GPU, and the processor 31 may be configured to, when targeting the pre-training estimated carbon emission amount corresponding to the large language model in the time interval to reach the specified requirement, construct an objective function:

Characterizing energy consumption generated on the GPU under a single time node based on the GPU frequency and voltage under the single time node;

In an alternative embodiment, the objective function is characterized by:

and is also provided with；/>

Wherein,represents the carbon emission intensity of the t-time node, +.>Representing under a t-time nodeGPU frequency->Representing GPU voltage at a t-time node, and N represents the total number of GPUs; />Representing the self-energy consumption value of a single GPU at the t-time node,>then the energy consumption value representing the other parts of the single GPU on the server where it is located at the t-time node,/->Representation->And->The proportional relation between the two; />And->For characterizing the parameters of the pre-training start-stop time node, wherein +.>Indicating whether the t-time node is during pre-training, < ->Indicating whether a server power-on event occurred at the t-time node due to a pre-training start-up,the method comprises the steps of representing a starting-up energy consumption value caused by a starting-up event of a server; t represents the task delay corresponding to the pre-training start-stop time node.

In an alternative embodiment, the processor 31, when solving the objective function, may be configured to:

configuring constraint conditions for the objective function, wherein the constraint conditions comprise that task time delay corresponding to a pre-training start-stop time node does not exceed a preset time delay threshold; the amount of tasks that can be completed in the selected time interval is not less than the total amount of the designated tasks;

and solving the objective function based on the constraint condition to obtain a solving result, wherein the solving result comprises parameter values corresponding to parameters for representing the pre-training start-stop time nodes and GPU resource quantity regulation parameter values under each time node in the pre-training period.

In an optional embodiment, the constraint condition further includes that start-stop time nodes of the pre-training and start-stop time nodes are consistent with start-stop time of a server where the GPU is located, and start-stop of the task only occurs once in the time interval.

In an alternative embodiment of the present invention,has a value of 0 or 1,/o>A1 indicates that the t-time node is in the pre-training period, < >>A value of 0 indicates that the node is not during pre-training at time t; />Has a value of 0 or 1,/o>1 indicates that a server power-on event occurs at the node at time t due to pre-training initiation,/->A0 indicates that a server power-on event has not occurred at time t.

Further, as shown in fig. 3, the computing device further includes: communication component 32, power component 33, and the like. Only some of the components are schematically shown in fig. 3, which does not mean that the computing device only includes the components shown in fig. 3.

It should be noted that, for the technical details of the embodiments of the computing device, reference may be made to the related descriptions of the embodiments of the method described above, which are not repeated herein for the sake of brevity, but should not cause a loss of protection scope of the present application.

Accordingly, embodiments of the present application also provide a computer-readable storage medium storing a computer program that, when executed, is capable of implementing the steps of the method embodiments described above that may be performed by a computing device.

The memory of FIG. 3 described above is used to store a computer program and may be configured to store various other data to support operations on a computing platform. Examples of such data include instructions for any application or method operating on a computing platform, contact data, phonebook data, messages, pictures, videos, and the like. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The communication assembly of fig. 3 is configured to facilitate wired or wireless communication between the device in which the communication assembly is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as a mobile communication network of WiFi,2G, 3G, 4G/LTE, 5G, etc., or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further comprises a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

The power supply assembly in fig. 3 provides power for various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method of resource optimization and carbon emission reduction for large language models, comprising:

constructing an objective function by taking the pre-training carbon emission estimated total quantity corresponding to the large language model in the time interval as an objective, wherein the product of the carbon emission intensity and the total energy consumption quantity in a single time node in the time interval is used for representing the carbon emission estimated quantity in a corresponding time node in the objective function, the total energy consumption quantity in a single time node is used for taking GPU resource quantity regulation parameters and parameters for representing pre-training start-stop time nodes as influence factors, and the carbon emission intensity refers to the carbon emission quantity required to be generated for producing one unit of energy;

Solving the objective function to determine a pre-training start-stop time node in the time interval and GPU resource quantity regulation parameter values during pre-training for the large language model;

2. The method of claim 1, wherein the GPU resource amount regulation parameters include GPU parallelism, and constructing an objective function with the aim that the total pre-trained carbon emission estimated amount corresponding to the large language model in the time interval reaches a specified requirement includes:

characterizing energy consumption caused by the GPU under the corresponding time node in the time interval based on the GPU parallelism under the single time node, the first energy consumption value and the second energy consumption value;

and combining parameters for representing the pre-training start-stop time nodes and the energy consumption caused by the GPU under the represented single time node, and representing the total energy consumption under the corresponding time node to construct the objective function.

3. The method of claim 2, the objective function characterized by:

wherein,represents the carbon emission intensity of the t-time node, +.>Representing the GPU parallelism at the t-time node,representing said first energy consumption value, < >>Representing the self-energy consumption value of the GPU under a single time node when the GPU is in an operating state, +.>Representing the energy consumption value of a single time node caused by other parts of the GPU on the server where the GPU is located when the GPU is in an operating state; />Indicating the number of idle GPUs at the t-time node, N being the total number of GPUs, and accordingly,representing said second energy consumption value, +.>Representing the self-energy consumption value of the GPU in an idle state under a single time node, +.>Representing the power consumption value of a single time node caused by other parts of the GPU on the server where the GPU is located when the GPU is in an idle state; />And->For characterizing the parameters of the pre-training start-stop time node, wherein +.>Indicating whether the t-time node is during pre-training, < ->Indicating whether a server power-on event occurred at the t-time node due to the start of pre-training,the method comprises the steps of representing a starting-up energy consumption value caused by a starting-up event of a server; t represents the task delay corresponding to the pre-training start-stop time node.

4. The method of claim 1, wherein the GPU resource amount regulation parameters include frequency and voltage of a GPU, and constructing an objective function with the goal that the pre-trained carbon emission estimated total amount corresponding to the large language model in the time interval reaches a specified requirement includes:

5. The method of claim 4, the objective function characterized by:

and is also provided with；/>

Wherein,represents the carbon emission intensity of the t-time node, +.>Represents the GPU frequency at the t-time node, +.>Representing GPU voltage at a t-time node, and N represents the total number of GPUs; />Representing the self-energy consumption value of a single GPU at the t-time node,>then it is indicated that the single GPU is taken at the t-time node at which it is locatedEnergy consumption value caused by other parts on the server,/->Representation->And->The proportional relation between the two; />And->For characterizing the parameters of the pre-training start-stop time node, wherein +. >Indicating whether the t-time node is during pre-training, < ->Indicating whether a server power-on event occurred at the t-time node due to a pre-training start-up,the method comprises the steps of representing a starting-up energy consumption value caused by a starting-up event of a server; t represents the task delay corresponding to the pre-training start-stop time node.

6. The method of claim 1, solving the objective function, comprising:

configuring constraint conditions for the objective function, wherein the constraint conditions comprise that task time delay corresponding to a pre-training start-stop time node does not exceed a preset time delay threshold; and the amount of tasks which can be completed in the time interval is not less than the total amount of the designated tasks;

7. The method of claim 6, the constraint further comprising pre-training start-stop time nodes to be consistent with start-stop opportunities of a server where the GPU is located, and task start-stops only occur once within the time interval.

8. The method according to claim 3 or 5, Has a value of 0 or 1,/o>A1 indicates that the t-time node is in the pre-training period, < >>A value of 0 indicates that the node is not during pre-training at time t; />Has a value of 0 or 1,/o>1 indicates that a server power-on event occurs at the node at time t due to pre-training initiation,/->A0 indicates that a server power-on event has not occurred at time t.

9. A computing device comprising a memory and a processor;

the memory is used for storing one or more computer instructions;

the processor is coupled to the memory for executing the one or more computer instructions for performing the resource optimization and carbon emission reduction method for a large language model of any one of claims 1-8.

10. A computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the resource optimization and carbon emission reduction method for large language models of any one of claims 1-8.