CN115408150A - Calculation strength measuring method, device and related equipment - Google Patents

Calculation strength measuring method, device and related equipment Download PDF

Info

Publication number
CN115408150A
CN115408150A CN202210915475.8A CN202210915475A CN115408150A CN 115408150 A CN115408150 A CN 115408150A CN 202210915475 A CN202210915475 A CN 202210915475A CN 115408150 A CN115408150 A CN 115408150A
Authority
CN
China
Prior art keywords
processing system
data processing
devices
operating state
xue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210915475.8A
Other languages
Chinese (zh)
Other versions
CN115408150B (en
Inventor
王飞
宋秉华
崔金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN115408150A publication Critical patent/CN115408150A/en
Application granted granted Critical
Publication of CN115408150B publication Critical patent/CN115408150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5094Allocation of resources, e.g. of the central processing unit [CPU] where the allocation takes into account power or heat criteria
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A computational method, suitable for use in a data processing system, comprising: obtaining constraints that affect the operation of the data processing system, the first operating state being indicative of operating conditions of a plurality of devices included in the data processing system, the plurality of devices including at least one of computing devices, storage devices, or network devices; and determining XUE of the plurality of devices according to the dynamic measurement mode and the acquired first operating state, wherein the dynamic measurement mode is used for indicating the mode of determining the XUE of the plurality of devices by utilizing the constraint condition. Because the calculation power of the data processing system is measured according to the operation state of the data processing system and the constraint condition influencing the operation of the data processing system, the determined XUE can embody the real performance of the data processing system under the influence of the constraint condition, and the practical value of the calculation power result (such as XUE) can be improved.

Description

Calculation method, calculation device and related equipment
The present application claims priority of chinese patent application entitled "a method for adaptive measurement of data center computing power" filed by the chinese intellectual property office at 2022, 15/06/15 under the application number 202210682182.X, which is incorporated herein by reference in its entirety.
Technical Field
The present application relates to the field of computer technologies, and in particular, to a calculation amount measuring method, an apparatus, and a related device.
Background
With the evolution of data centers from traditional data centers, green data centers, computing power data centers to converged data centers, computing power metrics for data centers are changing. For example, for a conventional data center, the static hardware computation performance is usually measured by using Operations Per Second (OPS) and floating-point operations per second (FLOPS). For a green data center, a Power Usage Efficiency (PUE) is generally used to evaluate an energy utilization efficiency of the data center, where the PUE is a ratio between a total energy consumption of the data center and an energy consumption actually used for Internet Technology (IT) equipment in the data center.
However, in an actual application scenario, because the number and load of the application programs running in the data center (or other objects) are constantly changing, the calculation strength result of the data center based on the PUE mode often cannot truly reflect the resource utilization rate of the data center, which results in low accuracy of the resource utilization evaluation result of the data center.
Disclosure of Invention
The application provides a calculation strength measuring method, so that the determined calculation strength result can reflect the real performance of the data processing system more comprehensively, and the practical value of the calculation strength for the data processing system can be improved. Corresponding apparatus, computing devices, computer-readable storage media, and computer program products are also provided.
In a first aspect, the present application provides a computation metric method, which is applicable to a data processing system, and in particular, obtains constraints that affect the operation of the data processing system, such as constraints in terms of energy consumption, load, cost, security level, and the like, and obtains a first operating state of the data processing system, where the first operating state is used to indicate operating conditions of a plurality of devices included in the data processing system, where the plurality of devices includes at least one of a computing device, a storage device, or a network device; then, according to a dynamic measurement mode and the acquired first operating state, determining the XUEs of the multiple devices, where the dynamic measurement mode is used to indicate a mode of determining the XUEs of the multiple devices by using the constraint condition, for example, a mode of performing measurement by using an AI model, or a mode of performing measurement by using a formula algorithm.
Because the calculation capacity of the data processing system is measured according to the running state of the data processing system and the constraint condition influencing the running of the data processing system, the determined XUE can embody the real performance of the data processing system under the influence of the constraint condition, and the practical value of the calculation capacity result (such as XUE) can be improved.
Furthermore, when the data processing system is limited by various constraint conditions, the calculation capacity device measures the calculation capacity of the data processing system based on the various constraint conditions, and can get rid of the limitation and one-sidedness of the measurement calculation capacity with single dimensionality, so that the real calculation capacity of the data processing system is reflected more comprehensively, and the reliability of the calculation capacity result is provided.
In a possible embodiment, the acquired first operating state is an operating state after an automated operation and maintenance policy is executed on a plurality of devices in the data processing system, and before the first operating state of the data processing system is acquired, a second operating state of the data processing system may also be acquired, and according to the second operating state and a constraint condition, the automated operation and maintenance policy is executed on the plurality of devices in the data processing system, where the automated operation and maintenance policy is used to instruct operations to be executed on at least one of the plurality of devices, such as adjusting an operating parameter of a device; therefore, when measuring the XUE value of the data processing system, specifically, when the first operation state and the second operation state satisfy the measurement condition, the XUE of the multiple devices can be determined according to the dynamic measurement mode and the first operation state. Therefore, when the condition that the measurement condition is met is determined, namely the operation state of the data processing system is determined to meet the constraint condition, the computing power of the data processing system can be measured, and the XUE value obtained through measurement can reflect the real computing power of the data processing system under the constraint condition.
In a possible implementation manner, when the automated operation and maintenance strategy is executed on the multiple devices, the automated operation and maintenance strategy output by the AI model is obtained by using the AI model and performing inference according to the second operation state and the constraint condition, so as to execute the automated operation and maintenance strategy on the multiple devices. In this way, the AI model can be utilized to implement automated operation and maintenance of multiple devices in the data processing system, thereby enabling the data processing system to satisfy the constraint conditions during operation.
In a possible embodiment, when determining the XUE values of the multiple devices, specifically, the method may be to calculate, by using a reinforcement learning algorithm, whether the AI model converges according to the first operating state and the second operating state, and when the AI model converges, determine the XUE of the multiple devices output by the AI model according to the first operating state. When the AI model converges, the operation state of the characterization data processing system is in a stable state and meets the constraint condition under the adjustment of the automatic operation and maintenance strategy output based on the AI model, and at the moment, the measurement result obtained aiming at the computing power of the data processing system, namely XUE, can embody the real computing power of the data processing system.
In one possible implementation, the reinforcement learning algorithm includes a Q learning algorithm.
In one possible embodiment, the AI model is constructed by a deep reinforcement learning algorithm, or an automatic reinforcement learning algorithm.
In one possible embodiment, the constraints include at least one of load constraints, cost constraints, energy consumption constraints, and security level constraints of the data processing system. Therefore, the calculation capacity measuring method for the data processing system can be suitable for the data processing system under various constraint conditions, and the universality of a calculation capacity scene can be improved.
In a possible implementation manner, when the automation operation and maintenance policy is executed on the plurality of devices, specifically, the target action may be selected from an action space according to the automation operation and maintenance policy, where the action space includes at least one of an operation setting type action, a load scheduling type action, or an operation and maintenance management type action, so that the operation parameters of the plurality of devices may be adjusted according to the target action. Therefore, automatic operation and maintenance of the data processing system can be achieved, and the running state of the data processing system can gradually meet the constraint condition through multiple times of operation and maintenance of the data processing system.
In one possible implementation, the operational state of the data processing system includes one or more of a computational state, a storage state, and a network state.
In a possible implementation manner, when obtaining the constraint condition that affects the operation of the data processing system, the configuration interface may be specifically output, and in response to a configuration operation of a user on the configuration interface for the constraint condition, at least one constraint condition is obtained. Therefore, the user can realize the user-defined configuration of the constraint conditions, and the flexibility of measuring the computing power of the data processing system by the user is improved.
In one possible embodiment, the data processing system includes a data center, or usable area, or partition.
In one possible embodiment, the data processing system is used to perform at least one type of task from big data, AI, HPC.
In a second aspect, an embodiment of the present application provides a computation measuring method, including: obtaining a plurality of constraints for constraining operation of a data processing system, the data processing system comprising a plurality of computing devices; collecting the running state of the data processing system; and when the operating state of the data processing system meets the multiple constraint conditions, generating extensible fusion service efficiency XUE by using an artificial intelligence AI model, wherein the XUE is generated by the artificial intelligence AI model according to the multiple constraint conditions and the operating state of the data processing system, and the XUE is used for evaluating the performance of the data processing system. Because the data processing system is constrained by various conditions in the actual application scene, when the performance of objects such as a data center is evaluated, the data processing system is evaluated by utilizing various constraint conditions of the data processing system, the real performance of the data processing system can be more comprehensively embodied, and the practicability of performance evaluation can be improved. In addition, the AI model is used for evaluating the performance of the data processing system, and the traditional solidification mode for evaluating the performance of the system by using a manually defined mathematical formula can be broken through, so that the self-adaptive measurement capability and the evaluation factor expansion capability of the performance of the data processing system can be improved.
In one possible embodiment, the plurality of constraints includes any of a plurality of load constraints, cost constraints, energy consumption constraints, and security level constraints of the data processing system.
In one possible implementation, the operating state of the data processing system includes one or more of a computing state, a storage state, and a network state.
In one possible embodiment, the AI model is constructed by a deep reinforcement learning algorithm, or an automatic reinforcement learning algorithm.
In one possible embodiment, the generating the extensible fusion utilization efficiency XUE by using the artificial intelligence AI model when the operation state of the data processing system satisfies the plurality of constraints includes: determining whether the AI model meets a convergence condition by using a reinforcement learning algorithm according to the running state of the data processing system; when the AI model satisfies a convergence condition, determining the XUE output by the AI model.
In one possible embodiment, the determining whether the AI model satisfies the convergence condition using a Q learning algorithm according to the operation state of the data processing system includes: acquiring a strategy output by the AI model according to the operation state of the data processing system at a first moment and the multiple constraint conditions, wherein the strategy is used for indicating an action of adjusting the operation parameters of the data processing system; adjusting the operating parameters of the data processing system according to the strategy; acquiring the operating state of the data processing system at a second moment corresponding to the adjusted operating parameters, wherein the second moment is later than the first moment; and calculating whether the AI model meets a convergence condition or not according to the running state of the data processing system at the first moment and the running state of the data processing system at the second moment by using the Q learning algorithm.
In one possible embodiment, the adjusting the operation parameter of the data processing system according to the policy includes: selecting a target action from an action space according to the strategy, wherein the action space comprises one or more of operation setting actions, load scheduling actions and operation and maintenance management actions; and adjusting the operating parameters of the data processing system according to the target action.
In one possible implementation, the obtaining of the plurality of constraints includes: outputting a configuration interface; and obtaining the plurality of constraints in response to the configuration operation of the user on the configuration interface for the constraints.
In one possible embodiment, the data processing system comprises a data center, or an available area, or a partitioned arrangement for deployment.
In one possible embodiment, the data processing system is configured to perform at least one type of task selected from big data, artificial Intelligence (AI), and High Performance Computing (HPC).
In a third aspect, the present application provides an computation metric apparatus, where the computation metric apparatus includes modules for performing the computation metric method in the first aspect or any one of the possible implementation manners of the first aspect.
In a fourth aspect, the present application provides an algorithm metric apparatus comprising modules for performing the algorithm metric method of the second aspect or any possible implementation manner of the second aspect.
In a fifth aspect, the present application further provides a computing device comprising: a processor and a memory; the memory is used for storing computer instructions, and the processor is used for executing the operation steps of the computational method in any implementation method of the first aspect or executing the operation steps of the computational method in any implementation method of the second aspect or the second aspect according to the computer instructions stored in the memory. It should be noted that the memory may be integrated into the processor or may be independent from the processor. The computing device may also include a bus. Wherein, the processor is connected with the memory through a bus. The memory may include a readable memory and a random access memory, among others.
In a sixth aspect, the present application provides a chip, which includes a processor, and the processor is configured to perform the operation steps of the computational method according to any one of the implementations of the first aspect or the first aspect, or perform the operation steps of the computational method according to any one of the implementations of the second aspect or the second aspect.
In a seventh aspect, the present application provides a data processing system, where the data processing system includes multiple devices, where the multiple devices include at least one of a computing device, a storage device, and a network device, and the data processing system is configured to perform the operation steps of the computational method according to any one of the first aspect and the first implementation manner, or perform the operation steps of the computational method according to any one of the second aspect and the second implementation manner.
In an eighth aspect, the present application provides a computer-readable storage medium having stored therein instructions, which, when run on a computing device, cause the computing device to perform the operational steps of the computational method according to any one of the above first aspect or any one of the above implementations of the first aspect, or to perform the operational steps of the computational method according to any one of the above second aspect or any one of the above implementations of the second aspect.
In a ninth aspect, the present application provides a computer program product comprising instructions which, when run on a computing device, cause the computing device to perform the operational steps of the computational method of any of the implementations of the first aspect or the first aspect described above, or of the computational method of any of the implementations of the second aspect or the second aspect described above.
The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.
Drawings
FIG. 1 is a block diagram of an exemplary data processing system provided herein;
FIG. 2 is a schematic flow chart of a computational method provided herein;
FIG. 3 is a schematic diagram of an exemplary configuration interface provided herein;
FIG. 4 is a schematic diagram illustrating an AI model training process provided herein;
FIG. 5 is a schematic diagram of an algorithm framework of the AI model provided herein;
fig. 6 is a schematic structural diagram of a force calculating device provided in the present application;
fig. 7 is a schematic diagram of a hardware structure of a computing device provided in the present application.
Detailed Description
In order to solve the problem of low accuracy of a resource utilization evaluation result of a data center, the application provides a method for calculating the strength quantity, which determines an extensible fused resource utilization (XUE) by using a constraint condition influencing the operation of a data processing system (such as the data center) and adopting a dynamic measurement mode, and further measures the calculation force of the data processing system by using the XUE.
The technical solutions in the present application will be described below with reference to the drawings in the embodiments of the present application.
Referring to fig. 1, a system architecture diagram of a data processing system is provided for the present application. As shown in FIG. 1, data processing system 100 comprises a plurality of devices including at least one of a computing device, a storage device, or a network device. In fig. 1, the data processing system includes a computing device 1, a storage device 2, and a network device 3, and in practical applications, the data processing system 100 may include any number of devices or any type of devices. Illustratively, data processing system 100 may be a data center including a plurality of devices, or may be an Availability Zone (AZ) including a plurality of devices, or may be a partition (region) including a plurality of devices, or may be another type of cluster, which is not limited in this embodiment. Moreover, the computing power provided by the devices in data processing system 100 may support data processing system 100 providing one or more cloud services; alternatively, when data processing system 100 is deployed on a user side, one or more services may be provided on the user side, and so on. In practice, the data processing system 100 may be used to implement at least one type of task from big data, artificial Intelligence (AI), and High Performance Computing (HPC).
Target system data processing system 100 may communicate with client 200 such that a user (e.g., an administrator or the like) may be able to configure data processing system 100 via client 200, such as to configure at least one constraint, such as an upper load limit, of the data processing system. In practice, the client 200 may be, for example, an application program running on a user-side device, or may be a web browser provided externally to the data processing system 100.
In a practical application scenario, data processing system 100 may typically be constrained by one or more conditions. For example, the energy consumption of the IT devices in data processing system 100 may be constrained to meet an upper energy consumption factor, the load on data processing system 100 may not exceed a maximum load, the security level required to be met by data processing system 100, an upper daily cost limit for data processing system 100, and the like. Moreover, these constraints typically affect the resource allocation and operation of data processing system 100, thereby affecting the computational output (or, alternatively, effective computational power) of data processing system 100. At this time, when the computing power of the data processing system 100 is measured, if the computing power of the data processing system is measured by the PUE method, the result of the computing power obtained by calculation often cannot truly reflect the resource utilization rate of the data processing system 100, which results in low accuracy of the result of evaluating the resource utilization of the data processing system 100.
Based on this, the embodiment of the present application provides a method for calculating a power metric, where the method is performed by the power calculating apparatus 101 in fig. 1, and specifically, the power calculating apparatus 101 determines an extensible fused resource utilization (XUE) for measuring the power of the data processing system 100 in a dynamic metric manner according to at least one constraint condition affecting the operation of the data processing system 100 and an operating state of the data processing system 100, where the dynamic metric manner is used for indicating a manner of determining XUEs corresponding to a plurality of devices in the data processing system 100 by using the at least one constraint condition. Further, the computation workload device 101 may also feed back the XUE to the client 200 so as to present it to the user through the client 200.
Because the calculation capacity measuring device 101 measures the calculation capacity of the data processing system 100 according to the operating state of the data processing system 100 and the constraint condition influencing the operation of the data processing system 100, the determined XUE can embody the real performance of the data processing system 100 under the influence of the constraint condition, so that the practical value of the calculation capacity result (namely XUE) can be improved. Further, when the data processing system 100 is limited by various constraints, the computation power measuring device 101 measures the computation power of the data processing system 100 based on the various constraints, and can get rid of the limitation and one-sidedness of the single-dimensional measurement computation power, thereby more comprehensively reflecting the real computation power of the data processing system 100 and providing the reliability of the computation power result.
Moreover, when the calculation strength formula is specifically the calculation strength by using the AI model, the traditional solidification mode of using the manually defined calculation strength formula can be broken through, and the limitation of closed and static measurement of the calculation strength of the data processing system 100 measured by using the formula is eliminated, so that the self-adaptive measurement capability and the measurement factor expansion capability of the calculation strength of the data processing system 100 can be improved.
Illustratively, the computation power means may be implemented by software, for example, may be implemented by at least one of a virtual machine, a container, a computation engine, and the like. At this time, computation power quantity apparatus 101 may run on computing device 1 in data processing system 100, as shown in fig. 1; or may run on other computing devices separately deployed within data processing system 100.
Alternatively, the computation power amount means may be implemented by a computing device including a processor, and the computing device may be, for example, the computing device 1 in fig. 1, or may be a computing device separately deployed in the data processing system 100, and the like, which is not limited in this embodiment. The processor in the computing device may be a CPU, an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), a system on chip (SoC), a software-defined architecture (software-defined information structure, SDI) chip, an Artificial Intelligence (AI) chip, a Data Processing Unit (DPU), or any combination thereof. Moreover, the number of processors included in the computation workload device 101 may be one or more, the type of the included processors may be one or more, and the number and the type of the processors may be specifically set according to the service requirement of the actual application, which is not limited in this embodiment.
In actual deployment, the computation power device may be deployed in the data processing system 100, such as the computation power device 101 shown in fig. 1, may be deployed as software on a computing device in the data processing system 100, or may be a hardware device deployed independently in the data processing system 100. Alternatively, the computational power device may be deployed external to data processing system 100 and perform the computational power measurements of data processing system 100 by data interaction with data processing system 100.
It should be noted that the data processing system 100 shown in FIG. 1 is only used as an example, and in practical applications, the data processing system 100 is not limited to the example shown in FIG. 1. For example, in other possible implementations, the data processing system 100 may include a greater number of devices or may include hardware devices with other functions, which is not limited by the embodiment.
For the sake of understanding, the following describes an embodiment of the computation method provided in the present application with reference to the accompanying drawings.
Referring to fig. 2, fig. 2 is a schematic flow chart of a computational method provided in the present application, which may be applied to the data processing system 100 shown in fig. 1, or may be applied to other applicable systems. For convenience of explanation, the present embodiment is exemplified by being applied to the application scenario shown in fig. 1.
The calculation strength measuring method shown in fig. 2 may be executed by the calculation strength device 101 in fig. 1, and based on the data processing system 100 shown in fig. 1, the calculation strength measuring method shown in fig. 2 may specifically include:
s201: the computation workload device 101 obtains at least one constraint condition sent by the client 200, where the at least one constraint condition is used to constrain the operation of the data processing system 100, and the data processing system 100 includes a plurality of devices, and the plurality of devices includes at least one of a computing device, a storage device, or a network device.
In this embodiment, the constraint condition refers to a factor that affects the operation of the data processing system 100, and also a factor that affects the computational power metric of the data processing system 100, and includes different types of constraint conditions such as energy consumption, load, security level, and cost, or may be other types of constraint conditions, which is not limited in this embodiment.
Wherein, the constraint condition of the data processing system 100 in terms of energy consumption (hereinafter referred to as the energy consumption constraint condition) is used to constrain that the energy consumption of the data processing system 100 can not exceed the upper limit of energy consumption. For example, the energy consumption constraint may be 15000 kilowatt-hours, which is used to constrain the daily power consumption of data processing system 100 to not exceed 15000 kilowatt-hours. Accordingly, with limited power consumption, the devices within data processing system 100 are typically limited in operation, which may affect the computational output of data processing system 100. For example, some devices of data processing system 100 may remain operating at a lower power level due to power consumption limitations, resulting in lower computational power (relative to computational power to remain operating at a higher power level) being output by the devices.
A constraint on the load of the data processing system 100 (hereinafter simply referred to as a load constraint) for constraining the load of the data processing system 100 not to exceed an upper limit of the load. The load of the data processing system 100 may be, for example, tasks, processes, threads, and the like executed by the data processing system 100, so that the number of tasks, processes, or threads executed by the data processing system 100 in a period of time may not exceed a preset upper limit by using a load constraint condition. Accordingly, data processing system 100 may be limited in the amount of load that may result in a limited amount of computing power being used to process the load, thereby affecting the computing power output of data processing system 100.
The constraints on the security level of data processing system 100 (hereinafter referred to as security level constraints) are used to constrain the security level that data processing system 100 achieves at run-time. Illustratively, the security levels may be divided into five levels, one to five levels respectively, and the different security levels correspond to the configuration of data processing system 100. Generally, the higher the security level, the more constraints are placed on the resource allocation of the data processing system, and thus the greater the constraints are placed on the computational power output of the data processing system. For example, when the security level of data processing system 100 is two levels, data processing system 100 may not need to encrypt or decrypt the communication data, and when the security level of data processing system 100 is three levels (or more), data processing system 100 may need to encrypt or transmit the data and decrypt the received data. As such, the data processing system may consume a portion of the computing power by requiring the process of encrypting and decrypting the communicated data to be performed, resulting in a lower computing power that data processing system 100 can provide.
The data processing system's cost constraints (hereinafter referred to as cost constraints) are used to constrain that the costs incurred by the data processing system 100 during operation cannot exceed the upper cost limit. For example, the fee generated by the data processing system 100 during operation may be, for example, an electric fee. Accordingly, the operation of multiple devices in data processing system 100 may be cost-limited, which may result in limitations in device operating power or the number of devices that can be operated, and may affect the computing power that data processing system 100 can output.
It should be noted that the above constraints are only used as a few exemplary illustrations, and in practical applications, the data processing system 100 may include any one or more of the above constraints, or may include other types of constraints, which is not limited by the embodiment.
In one possible embodiment, the constraints corresponding to data processing system 100 may be configurable by a user. Specifically, the computation power device 101 may provide a Graphical User Interface (GUI) to the user, where the GUI may also be referred to as a configuration interface. The user may access the configuration interface through the client 200 to configure the constraints. Accordingly, the computation power device 101 may obtain at least one constraint configured by the user according to the configuration operation of the user on the configuration interface for the constraint. For example, computational workload device 101 may present a configuration interface, as shown in fig. 3, to the user via client 200, where the configuration interface may include constraints for 4 aspects of energy consumption, load, security level, and cost of data processing system 100, such that the user may select two or more constraints from the 4 constraints. Moreover, for each constraint condition selected by the user, the computation workload device 101 may further present one or more candidate items corresponding to the constraint condition on the configuration interface, so that the user may select the candidate item on the configuration interface to complete the configuration of the constraint condition. Or, for the constraint condition selected by the user, the user may also input specific information of the constraint condition on the configuration interface, so as to implement configuration of the constraint condition. For example, for an energy consumption constraint, a user may input an energy consumption upper limit in an input box corresponding to the energy consumption constraint, so as to configure the data processing system 100 that the energy consumption does not exceed the energy consumption upper limit.
In practical application, the computation power device 101 may also obtain the at least one constraint condition in other manners, for example, a user may import a configuration file including the at least one constraint condition to the computation power device 101, so as to implement configuration of the data processing system 100, which is not limited in this embodiment.
For this reason, in the present embodiment, the calculation power measuring device 101 may measure the calculation power of the data processing system 100 under the constraint condition by using a dynamic measurement method according to the at least one constraint condition. The dynamic measurement method is a method for indicating the calculation force measured by the constraint condition measurement data processing system 100, and may use an AI model to calculate the force amount, or may use a formula defined manually to calculate the force amount, which is not limited in this embodiment. For ease of understanding, the process of measuring the computational power of data processing system 100 using the AI model is described in detail below.
Illustratively, the AI model may be, for example, a deep learning model constructed based on a deep reinforcement learning algorithm or an automatic reinforcement learning algorithm, or may be another type of AI model, which is not limited in this embodiment.
S202: the calculation capacity device 101 samples the data processing system 100 to obtain the operation state S of the data processing system 100 at time t t The operating state S t Which indicates the operating conditions of a number of devices in data processing system 100 at time t.
S203: calculation of force quantity device 101 versus running state S t Is digitally processed to facilitate an operation state S of data processing system 100 t Is converted intoInput data for AI model.
The operating status of the data processing system 100 is used to indicate the operating status of a plurality of devices included in the data processing system 100. The operating state of data processing system 100 may include, for example, one or more of a computing state, a storage state, and a network state of a plurality of devices in data processing system 100.
Wherein the computing state may be used to embody the computing power of multiple devices in the data processing system 100. For example, the computational state of data processing system 100 may be characterized by one or more of utilization of at least one type of processor of data processing system 100, OPS, and FLOPS. The processor in data processing system 100 may be a processor with general purpose computing power or a processor with heterogeneous computing power. A processor of general purpose computing power may be, for example, a CPU, including a CPU of the X86 architecture, ARM architecture, RISC-V architecture, or the like. The processors of heterogeneous computing power may be, for example, GPUs, FPGAs, ASICs, or the like, wherein the ASICs may be TPU, NPU, or the like type processors. When the calculation state is digitized, for example, vectorization processing may be performed on the calculation state to obtain a vectorized expression corresponding to the calculation state. Illustratively, as shown in FIG. 4, the computation state of data processing system 100 may be digitized by a vector
Figure BDA0003774363940000081
Representing the calculation state, the vector c may include 7 components, each c 1 To c 7 Each component indicates a parameter of one type of processor, such as at least one of specification, utilization, OPS, FLOPS, or may be other parameters.
The memory state may be used to embody the memory capabilities of data processing system 100. For example, the storage state of the data processing system 100 may be characterized by one or more of the storage capacity, the read/write speed (read speed, write speed) of the data processing system 100. In the case of digitizing the memory states,for example, vectorization processing may be performed on the storage state to obtain a vectorized expression corresponding to the storage state. Illustratively, as shown in FIG. 4, the storage states of data processing system 100 may be digitized through vectors
Figure BDA0003774363940000091
Representing the memory state, the vector d may include 2 components, each d 1 And d 2 Wherein d is 1 Indicating the storage capacity of data processing system 100, d 2 Indicating the read and write speed of data processing system 100. In other embodiments, vector d may also include more other components, such as a component d for indicating cache utilization 3 Component d for indicating available memory space 4 For example, this embodiment is not limited thereto.
Network state, which may be used to embody data transmission capabilities of data processing system 100. For example, the network status of data processing system 100 may be characterized by one or more of bandwidth, delay (e.g., data transmission latency), packet loss rate, and jitter rate of data processing system 100. When the network status is digitized, for example, the network status may be vectorized to obtain a vectorized expression corresponding to the network status. Illustratively, as shown in FIG. 4, the network state of data processing system 100 may be digitized using vectors
Figure BDA0003774363940000092
Representing the network state, the vector n may include 4 components, n 1 To n 4 Wherein n is 1 Indicating the bandwidth, n 2 Indicating delay, n 3 Indicating packet loss rate, n 4 Indicating the jitter rate. In other embodiments, data processing system 100 may also include more other components, such as a component n for indicating the throughput of data processing system 100 5 However, this embodiment is not limited to this.
After obtaining vectorization representations corresponding to the calculation state, the storage state, and the network state, the calculation metric device 101 may further calculate a product of the preset weight vector w and (c, d, n), so as to obtain a tensor result S corresponding to the state space, which is, as shown in fig. 4, AI model input data obtained by digitizing the operation state of the data processing system 100.
It should be noted that the above implementation manners for the computation state, the storage state, and the network state are only used as some exemplary illustrations, and various operation states of the data processing system 100 may be represented in other applicable manners in a practical application scenario. Alternatively, the operational state of data processing system 100 may include other types of states such as load states for characterizing the load conditions (e.g., number of tasks, number of threads, number of processes, etc.) of multiple devices.
In practical applications, the computation workload device 101 may periodically collect the operating status of the data processing system 100, so as to continuously monitor whether the operating status of the data processing system 100 satisfies the constraint condition configured by the user. Alternatively, the computation power device 101 may also collect the operating status of the data processing system 100 in the presence of a trigger event, for example, a user triggers the computation power device 101 to collect the operating status of the data processing system 100.
In practical applications, the computation workload device 101 may be configured with an auto sampler, and the auto sampler may be implemented by software or hardware. Specifically, when the computation workload device 101 is implemented by software, the auto sampler may be implemented by software; when the computation workload means 101 is implemented by hardware, the auto sampler may be implemented by hardware. Also, an auto-sampler may be capable of automatically acquiring one or more operating states of data processing system 100.
In specific implementation, the computation power measuring apparatus 101 may perform sampling on the operation state of the data processing system 100 periodically or based on a trigger event by using an auto sampler to obtain the operation state S of the data processing system 100 at time t t . The resulting data is sampled to form an observation snapshot of data processing system 100 at the current time, forming an observation space. After the user finishesBefore the configuration of the constraint, the operating state of data processing system 100 may not satisfy the constraint, such as the energy consumption generated by data processing system 100 based on the current operating state exceeding the upper energy consumption limit indicated by the energy consumption constraint. Therefore, the computation workload device 101 may sample the operating status of the data processing system 100 and adjust the multiple devices in the data processing system 100 by using the constraint condition configured by the user, so that the operating status of the data processing system 100 finally satisfies the constraint condition, which is described in detail in the following steps.
S204: the calculation capacity device 101 calculates the running state S after digital processing t Inputting the data into an AI model to obtain an automatic operation and maintenance strategy pi output by the AI model t The automated operation and maintenance strategy is pi t For instructing operations to be performed on at least one device among a plurality of devices included in data processing system 100, in particular, adjusting an operating parameter of the device.
As shown in fig. 5, a plurality of networks, such as an operator network and a critic network, may be included in the AI model, and the operator network and the critic network may share an input layer and a hidden layer. The input layer is configured to receive an operation status of the data processing system 100 input to the AI model, and the hidden layer is configured to perform corresponding operations, such as convolution operations, on the input data to perform feature extraction on the input data. The actor network and critic network each have their own output layer. Wherein, the output layer of the operator network is used for outputting a corresponding automatic operation and maintenance strategy pi according to the features extracted by the hidden layer t And the output layer can also output the value range of [0,1] through a normalization (softmax) layer]I.e., XUE, for measuring the power of data processing system 100. An output layer of the critic network for outputting an automatic operation and maintenance strategy pi according to the features extracted by the hidden layer t Indicated action A t Corresponding instant prize value R t
S205: the dynamics calculation device 101 selects the strategy pi from the action space t Matching actions A t
Wherein a plurality of actions may be included in the action space, different actions being used to adjust different operating parameters of at least one of the plurality of devices included in the data processing system 100.
Illustratively, the action space may include one or more of an operation setting class action, a load scheduling class action, and an operation and maintenance management class action, and each class of action may include one or more actions.
For example, the operation setting type action may include an action of adjusting a security level, an action of adjusting a resource quota, an action of adjusting energy consumption, and an action of setting a cost. Wherein the act of adjusting the security level is to place the data processing system 100 at S t Adjusting the security level of the moment to other levels; accordingly, in adjusting the security level, computation workload device 101 may adjust the operating parameters associated with the security level in data processing system 100. The action of adjusting the resource quota may specifically be an action of adjusting the sizes of the computing resource, the storage resource, and the network resource in the data processing system 100. An act of adjusting power consumption, i.e., an act of increasing or decreasing power consumption of data processing system 100, may be performed, for example, by decreasing the number of processors operating to reduce power consumption. The operation of setting the cost, specifically, the operation of adjusting the cost consumed by the data processing system 100, for example, the operation power of the IT device in the data processing system 100 may be reduced to reduce the power consumption of the IT device, thereby achieving the purpose of reducing the cost of the data processing system 100.
The load scheduling actions may include submitting a job, suspending a job, canceling a job, and the like. For example, computation workload device 101 may cancel all jobs for a portion of users, pause all jobs for a portion of jobs, or the like, when the load of data processing system 100 exceeds the upper load limit indicated by the load constraint.
The operation and maintenance management actions may include expansion or contraction of computing resources, expansion or contraction of storage resources, and expansion or contraction of network resources. The expansion or the reduction of the computing resources refers to increasing the computing resources or decreasing the computing resources, for example, the reduction of the computing resources may be implemented by closing part of the computing devices. The expansion or reduction of the storage resource refers to increasing the storage resource or decreasing the storage resource, for example, the reduction of the storage resource can be realized by closing part of the storage device. The expansion or contraction of the network resource refers to increasing or decreasing the network resource, for example, the expansion of the network resource may be achieved by increasing the network bandwidth, or the contraction of the network resource may be achieved by closing part of the network card.
It should be noted that the above actions are only used as some exemplary illustrations, and in an actual application scenario, the action included in the action space may be any action in the above multiple actions, or may include other actions, which is not limited by the embodiment.
Automatic operation and maintenance strategy pi output by AI model t For indicating a specific action for adjusting an operating parameter in the data processing system 100, so that the computation metric means 101 can find out from the actions comprised in the action space the action pi corresponding to the policy t Matching actions A t
Illustratively, the computation power device 101 may vectorize the above-mentioned various actions. As shown in fig. 4, after vectorizing the operation setting class actions, the actions can be vectorized by vectors
Figure BDA0003774363940000111
Representing the operation setting type action, the vector b may include 4 components, respectively b 1 To b 4 Wherein b is 1 Indicating a security level, b 2 Indicating resource quota, b 3 Indicating energy consumption quota, b 4 Indicating tariff settings. After vectorization is carried out on the load scheduling actions, the actions can pass through vectors
Figure BDA0003774363940000112
Representing the load scheduling class action, the vector W may include 3 components, each W 1 To W 4 Wherein, W 1 Indicating submission of a load job, W 2 Instruction to suspend load operation, W 3 Indicating to cancel a load job. After the operation and maintenance management actions are vectorized, the operation and maintenance management actions can be passed through the vector
Figure BDA0003774363940000113
The operation and maintenance management type action is represented, and the vector o can comprise 3 components which are respectively o 1 To o 4 Wherein o is 1 Indicating calculation of expansion or contraction volume, o 2 Indicating storage expansion or contraction, o 3 Indicating network expansion or contraction. After obtaining vectorization representations corresponding to the operation setting type actions, the load scheduling type actions, and the operation and maintenance management type actions, the computation dynamics device 101 may also compute a product of the preset weight vector λ and (b, W, o), so as to obtain a tensor result a corresponding to the motion space, as shown in fig. 4. Then, the computation workload device 101 may output an automated operation and maintenance policy pi according to the AI model t (vectorization representation), carrying out vector calculation on tensor result A corresponding to the action space to determine pi of the automatic operation and maintenance strategy in the action space t Matching actions A t
S206: force computation metric device 101 computes force based on action A t The operating parameters of at least one of the plurality of devices included in data processing system 100 are adjusted.
S207: the calculation capacity device 101 acquires a new operation state S of a plurality of devices in the data processing system 100 at a time t +1 based on the adjusted operation parameters t+1
S208: calculation of force quantity device 101 is based on running state S t Operating state S t+1 And at least one constraint condition, updating parameters in the AI model.
In specific implementation, the critic network in the AI model may use at least one constraint condition as a regular constraint term of the AI model according to the operating state S t Operating state S t+1 And the at least one constraint updates parameters (and hyperparameters) in the operator network.
In this way, the computation workload device 101 completes one round of iterative training for the AI model. The method is implemented to automate the operation and maintenance strategy pi t Then, the operating state of data processing system 100Constraints may still not be met, such as the power consumption generated by data processing system 100 based on the current operating state still exceeding the upper power consumption limit indicated by the power consumption constraint, and so on. At this point, XUE (as shown in FIG. 4) that the AI model outputs based on the state space, the action space, and the instantaneous reward value output by the critic network does not reflect the actual computing power that the data processing system 100 ultimately has under at least one constraint. Therefore, the computation workload device 101 may continue to iteratively train the AI model according to the above process, and continue to adjust the operation parameters of the devices in the data processing system 100 according to the automated operation and maintenance strategy output again by the AI model until the operation status of the data processing system 100 satisfies the constraint condition. For example, during the next iteration, the computation workload device 101 may collect the operating state S of the data processing system 100 at time t +1 t+1 The operation state S after digital processing t+1 Actions A input into and inferred from the AI model t+1 Adjust the operating parameters of the data processing system 100, and update the parameters in the operator network in the AI model. In addition, the calculation amount device 101 may record the operation state S in each iteration, the action a corresponding to the operation state S, and the action a t So as to obtain a sequence for S, A and R, such as the sequence { S }, based on multiple iterative training rounds t 、A t 、R t 、S t+1 、A t+1 、R t+1 、S t+2 、A t+2 、R t+2 …}。
In this manner, one or more iterations of the AI model may be performed that ultimately may result in the data processing system 100 satisfying at least one user-configured constraint at runtime. In general, when the constraint condition is satisfied during the operation of the data processing system 100, the AI model may be in a converged state, and at this time, the computation workload device 101 may adjust the data processing system 100 based on the automated operation and maintenance policy output by the AI model at each iteration, so that the operation status of the data processing system 100 may be in a substantially stable state. Therefore, in this embodiment, the computation workload device 101 may determine the operation state S during the process of iteratively training the AI model t And operating state S t+1 Whether a metric condition is satisfied and, when it satisfies the metric condition, characterizing that data processing system 100 is already able to satisfy the constraint condition at runtime, at which point the true computing power of data processing system 100 can be measured. When state S t And operating state S t+1 If the metric condition is not satisfied, the characterization data processing system 100 does not satisfy the constraint condition at the time of operation, and the operation parameters of the devices in the data processing system 100 may be continuously adjusted. In this embodiment, the measurement condition may specifically be to determine whether the AI model satisfies a convergence condition, so as to determine whether the operation of the data processing system 100 satisfies a constraint condition, which is described in the following steps. In other embodiments, the metric condition may be other applicable conditions, which is not limited by the embodiment.
S209: in the process of iteratively training the AI model, the computation workload device 101 determines whether the AI model satisfies the convergence condition based on a reinforcement learning algorithm.
As an implementation example, the computation power device 101 may specifically adopt a Q learning algorithm to determine whether the AI model satisfies the convergence condition. Specifically, the computation workload device 101 may randomly sample the recorded sequence of S, a, and R, and assume that the quadruple of the random samples is (S) t ,A t ,R t ,S t+1 ) Then the force calculation device 101 can calculate the operation status S t Inputting into AI model, and calculating in operation state S by operator network t And optimizing the value function Q value corresponding to each executable action in the lower action space by the critic network through a Bellman equation shown in the following formula (1) according to the value function Q values corresponding to the actions respectively, and calculating the maximum value function Q value.
Q(S t ,A t )=Q(S t ,A t )+α[R t +γmax Q(S t+1 ,A t+1 )-Q(S t ,A t )]Formula (1)
Wherein, Q (S) t ,A t ) Calculating the maximum value function Q value; q (S) t ,A t ) Based on the operating state S t And action A t The calculated cost function Q; alpha is the learning rate; r is t Is an action A t A corresponding instant prize value; gamma is a discount factor that indicates how far and near time affects value.
After determining the maximum merit function Q, the operating state and action corresponding to the maximum merit function Q may be further determined, assuming that the operating state is S t The action is A t Then the calculation metric means 101 may be based on the operating state S t Action A t It is determined whether the AI model converges.
For example, the force calculating means 101 may calculate θ from an objective function shown in the following equation (2) t
θ t =arg min θ L(Q(S t ,A t ;θ),R t +γQ(S t+1 ,A t+1 (ii) a θ)) formula (2)
When theta is measured t When the calculated value is smaller than the preset threshold value, the calculation metric device 101 may determine that the AI model satisfies the convergence condition; when theta is equal to t If the AI model is greater than or equal to the preset threshold, the computation workload device may determine that the AI model does not satisfy the convergence condition, and at this time, the computation workload device 101 may return to perform steps S403 to S408 to continue the iterative training of the AI model.
In actual application, the computation power calculating apparatus 101 may also use other types of reinforcement learning algorithms to determine whether the AI model satisfies the convergence condition, which is not limited in this embodiment.
S210: when the AI model satisfies the convergence condition, the computation power amount means 101 takes the XUE output by the AI model as an index for evaluating the computation power of the data processing system 100, and feeds it back to the client 200.
As shown in fig. 4, the AI model may include a softmax layer, and the output result from the softmax layer is a value between [0,1], which may be used to evaluate the converged resource utilization of the data processing system 100 under the condition of maximum utilization of resources such as computation, storage, network, and the like.
In this embodiment, the value range of the XUE may be [0,100% ], which may reflect the fusion resource utilization rate of the data processing system 100 under various constraints such as energy consumption, load, cost, security, and the like.
Further, the computation workload device 101 may also feed back the XUE output by the AI model to the client 200, so that the client 200 presents the XUE to the user. In this way, data processing system 100 can implement scheduling and allocation of resources based on the XUE metric results. For example, when the XUE value is less than 50%, the calculation capability that can be provided by the data processing system 100 is low, and at this time, the user may reduce the number/speed of the services scheduled to the data processing system 100, so as to avoid that the processing efficiency of the services is affected by the blocking of part of the services caused by insufficient calculation capability of the data processing system 100. Conversely, when the XUE value is greater than 75%, the computational power that can be provided to characterize data processing system 100 is higher, at which point the user may increase the amount/speed of traffic scheduled to that data processing system 100 to take advantage of the high computational power of data processing system 100 to process more traffic or to speed up traffic processing.
Therefore, the calculation strength measuring device 101 can not only realize calculation strength measurement on the data processing system 100 based on the AI model and the constraint conditions, but also, when the data processing system 100 is constrained by various constraint conditions, the calculation strength measuring device 101 generates the XUE based on various constraint conditions, so that the comprehensive calculation strength of the data center can be reflected more comprehensively, objectively and dynamically, and the practicability of calculation strength evaluation is improved. Moreover, the operating state and constraint conditions upon which the XUE is generated may be customized according to the operating parameters of data processing system 100 to expand information components (e.g., expansion constraint conditions, state space, etc.), thereby flexibly adapting to the changing requirements of data processing system 100 in the future evolution process.
The computing power of the data processing system 100 may change due to dynamic changes of the load, computing resources, storage resources, network resources, etc. of the data processing system 100 during operation, such as an increase in the load, an increase in the computing resources, etc. of the data processing system 100. Therefore, in practical application, the computation power measuring apparatus 101 may continuously monitor the operation state of the data processing system 100, and output the XUE of the data processing system 100 in real time by using the AI model according to the operation state of the data processing system 100, and feed it back to the user through the client 200, so that the user can view the real computation power of the data processing system 100 in real time.
In addition, in this embodiment, the XUE for measuring the computing power of the data processing system 100 is mainly output by an AI model, but in actual application, the computing power measuring apparatus 101 may measure the computing power of the data processing system 100 based on other measurement methods, such as measuring by a manually defined formula, and the like, which is not limited in this embodiment.
It should be noted that, in the embodiment, it is mainly described that the computation power measuring apparatus 101 determines the XUE corresponding to the data processing system 100 under the condition that the constraint condition is satisfied by means of iterating the AI model for multiple times, so as to complete the computation power measurement on the data processing system 100. In other possible embodiments, the AI model may be trained for the data processing system 100 in advance, so that the computation power device may directly use the AI model to perform inference according to the operation state of the data processing system 100 and output an XUE value for measuring the computation power of the data processing system 100. For example, in an actual application scenario, there may be a plurality of data centers having the same constraint condition, and resource amounts of the plurality of data centers in terms of calculation, storage, network, and the like are similar, so after iterative training of an AI model is completed based on one data center in advance, the AI model may be assigned to the remaining data centers. In this way, the remaining individual data centers may directly measure the computing power of the data center based on the AI model.
In further embodiments, after obtaining an XUE value for measuring the computing power of data processing system 100, computing power device 101 may also notify data processing system 100 of the XUE. In this way, data processing system 100 can perform resource scheduling, resource adjustment, etc. operations based on the XUE values. For example, when the value of XUE is large (e.g., greater than 85%), the computational effort for characterizing data processing system 100 is high, and at this time, data processing system 100 may schedule more resources for the task currently being processed, so as to improve task processing efficiency. When the value of XUE is small (e.g., less than 40%), the computational power characterizing data processing system 100 is low, and at this time, data processing system 100 may reduce the resources allocated to the currently processed task by performing resource rescheduling, so that data processing system 100 can process a larger number of tasks based on the limited computational power. In practical applications, the data processing system 100 may also execute other resource scheduling policies according to the XUE value, which is not limited in this embodiment.
It is noted that other reasonable combinations of steps, which can be conceived by those skilled in the art from the above description, also fall within the scope of the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
The calculation capacity method provided by the embodiment of the present application is described above with reference to fig. 1 to 5, and the functions of the calculation capacity device provided by the embodiment of the present application and the computing device implementing the calculation capacity device are described next with reference to the accompanying drawings.
Referring to fig. 6, a schematic diagram of a force computation metric apparatus is shown. The computation power quantity device 600 shown in fig. 6 is suitable for a data processing system, and as shown in fig. 6, the computation power quantity device 600 includes:
an obtaining module 601, configured to obtain constraint conditions that affect operation of the data processing system; acquiring a first operating state of the data processing system, wherein the first operating state is used for indicating operating conditions of a plurality of devices included in the data processing system, and the plurality of devices include at least one of computing devices, storage devices or network devices;
a metric module 602, configured to determine a scalable converged resource utilization XUE of the multiple devices according to a dynamic metric manner and the first operating state, where the dynamic metric manner is used to indicate a manner in which the XUE of the multiple devices is determined by using the constraint condition.
In a possible implementation manner, the first operating state is an operating state after an automated operation and maintenance policy is executed on the plurality of devices;
the obtaining module 601 is further configured to obtain a second operating status of the data processing system before obtaining the first operating status of the data processing system;
the computation metric device 600 further includes:
an automation operation and maintenance module 603, configured to execute an automation operation and maintenance policy on the multiple devices according to the second operating state and the constraint condition, where the automation operation and maintenance policy is used to indicate an operation performed on at least one device in the multiple devices;
then, the metric module 602 is configured to determine the XUE of the multiple devices according to the dynamic metric manner and the first operating state when the first operating state and the second operating state satisfy a metric condition.
In a possible implementation, the automation operation module 603 is configured to:
reasoning according to the second running state and the constraint condition by using an Artificial Intelligence (AI) model to obtain the automatic operation and maintenance strategy output by the AI model;
executing the automated operation and maintenance policy on the plurality of devices.
In a possible implementation, the metric module 602 is configured to:
calculating whether the AI model converges according to the first operation state and the second operation state by using a reinforcement learning algorithm;
determining the XUE of the plurality of devices that the AI model outputs according to the first operating state when the AI model converges.
In one possible implementation, the reinforcement learning algorithm includes a Q learning algorithm.
In one possible embodiment, the AI model is constructed by a deep reinforcement learning algorithm, or an automatic reinforcement learning algorithm.
In one possible embodiment, the constraints include at least one of load constraints, cost constraints, energy consumption constraints, and security level constraints of the data processing system.
In a possible implementation, the automation operation module 603 is configured to:
selecting a target action from an action space according to the automatic operation and maintenance strategy, wherein the action space comprises at least one of an operation setting action, a load scheduling action or an operation and maintenance management action;
adjusting the operating parameters of the plurality of devices according to the target action.
In a possible implementation, the obtaining module 601 is configured to:
outputting a configuration interface;
at least one constraint condition is obtained in response to a configuration operation of a user on the configuration interface aiming at the constraint condition.
In one possible embodiment, the data processing system includes a data center, or usable area, or zone.
In one possible embodiment, the data processing system is configured to perform at least one type of task from big data, artificial Intelligence (AI), and High Performance Computing (HPC).
Since the computation power device 600 shown in fig. 6 corresponds to the method executed by the computation power device 101 in the embodiment shown in fig. 2, the specific implementation manner of the computation power device 600 shown in fig. 6 and the technical effects thereof can be referred to the description of the relevant parts in the foregoing embodiments, and are not described herein again.
Fig. 7 is a schematic diagram of a computing device 700 provided herein.
As shown in fig. 7, the computing device 700 includes a processor 701, a memory 702, and a communication interface 703. The processor 701, the memory 702, and the communication interface 703 communicate with each other via a bus 704, and may also communicate with each other by other means such as wireless transmission. The memory 702 is used for storing instructions and the processor 701 is used for executing the instructions stored by the memory 702. Further, computing device 700 may also include a memory unit 705, where memory unit 705 may be coupled to processor 701, storage 702, and communication interface 703 via a bus 704. The memory 702 stores program codes, and the processor 701 can call the program codes stored in the memory 702 to perform the following operations:
obtaining constraint conditions influencing the operation of the data processing system;
acquiring a first operating state of the data processing system, wherein the first operating state is used for indicating operating conditions of a plurality of devices included in the data processing system, and the plurality of devices include at least one of computing devices, storage devices or network devices;
and determining extensible converged resource utilization XUE of the plurality of devices according to dynamic metric modes and the first operating state, wherein the dynamic metric modes are used for indicating the mode of determining the XUE of the plurality of devices by utilizing the constraint conditions.
It should be understood that in the embodiments of the present application, the processor 701 may be a CPU, and the processor 701 may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete device components, or the like. The general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 702 may include both read-only memory and random access memory and provides instructions and data to the processor 701. The memory 702 may also include non-volatile random access memory. For example, the memory 702 may also store device type information.
The memory 702 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct rambus RAM (DR RAM).
The communication interface 703 is used for communicating with other devices connected to the computing device 700. The bus 704 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. But for clarity of illustration the various busses are labeled in the figures as bus 704.
It should be understood that the computing device 700 according to the embodiment of the present application may correspond to the computation power device 600 in the embodiment of the present application, and may correspond to a method performed by the computation power device 101 in the embodiment shown in fig. 2 in the embodiment of the present application, and the foregoing and other operations and/or functions implemented by the computing device 700 are not described herein again for brevity in order to implement the corresponding flow of the method shown in fig. 2, respectively.
The present embodiment also provides a chip, which includes a processor for executing the method steps performed by the computation power device 101 in fig. 2.
The embodiment of the application also provides a computer readable storage medium. The computer-readable storage medium can be any available medium that a computing device can store or a data storage device, such as a data center, that contains one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), among others. The computer-readable storage medium includes instructions that direct a computing device to perform the above-described computational method.
The embodiment of the application also provides a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computing device, cause the processes or functions described in accordance with embodiments of the application to occur, in whole or in part.
The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, or data center to another website site, computer, or data center by wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.).
The computer program product may be a software installation package which may be downloaded and executed on a computing device in the event that any of the aforementioned computational methods are required.
The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded or executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (15)

1. A computational method, adapted for use in a data processing system, comprising:
obtaining constraint conditions affecting the operation of the data processing system;
acquiring a first operating state of the data processing system, wherein the first operating state is used for indicating operating conditions of a plurality of devices included in the data processing system, and the plurality of devices include at least one of computing devices, storage devices or network devices;
and determining extensible converged resource utilization XUE of the plurality of devices according to dynamic metric modes and the first operating state, wherein the dynamic metric modes are used for indicating the mode of determining the XUE of the plurality of devices by utilizing the constraint conditions.
2. The method of claim 1, wherein the first operating state is an operating state after an automated operation and maintenance policy is executed on the plurality of devices, the method further comprising:
acquiring a second operating state of the data processing system before acquiring the first operating state of the data processing system;
executing an automated operation and maintenance strategy on the plurality of devices according to the second operating state and the constraint condition, wherein the automated operation and maintenance strategy is used for indicating an operation executed on at least one device in the plurality of devices;
then, the determining the scalable converged resource utilization XUE of the multiple devices according to the dynamic metric method and the first operating state includes:
and when the first operation state and the second operation state meet a measurement condition, determining the XUE of the plurality of devices according to the dynamic measurement mode and the first operation state.
3. The method of claim 2, wherein executing an automated operation and maintenance strategy on the plurality of devices according to the second operating state and the constraint condition comprises:
reasoning is carried out according to the second running state and the constraint condition by utilizing an Artificial Intelligence (AI) model to obtain the automatic operation and maintenance strategy output by the AI model;
executing the automated operation and maintenance policy on the plurality of devices.
4. The method according to claim 2 or 3, wherein the determining the XUE of the plurality of devices according to the dynamic metric method and the first operating state when the first operating state and the second operating state satisfy a metric condition comprises:
calculating whether the AI model converges according to the first operation state and the second operation state by using a reinforcement learning algorithm;
determining the XUE of the plurality of devices that the AI model outputs according to the first operating state when the AI model converges.
5. The method of claim 4, wherein the reinforcement learning algorithm comprises a Q learning algorithm.
6. The method according to any one of claims 3 to 5, wherein the AI model is constructed by a deep reinforcement learning algorithm, or an automatic reinforcement learning algorithm.
7. The method of any of claims 1 to 5, wherein the constraints comprise at least one of load constraints, cost constraints, energy consumption constraints, and security level constraints of the data processing system.
8. The method of any of claims 2 to 7, wherein the executing the automated operation and maintenance policy on the plurality of devices comprises:
selecting a target action from an action space according to the automatic operation and maintenance strategy, wherein the action space comprises at least one of operation setting actions, load scheduling actions or operation and maintenance management actions;
adjusting the operating parameters of the plurality of devices according to the target action.
9. The method of any of claims 1 to 8, wherein obtaining constraints that affect the operation of the data processing system comprises:
outputting a configuration interface;
at least one constraint condition is obtained in response to a configuration operation of a user on the configuration interface aiming at the constraint condition.
10. A computational metric device, wherein the computational metric device is adapted for use in a data processing system, the computational metric device comprising:
the acquisition module is used for acquiring constraint conditions influencing the operation of the data processing system; acquiring a first operating state of the data processing system, wherein the first operating state is used for indicating operating conditions of a plurality of devices included in the data processing system, and the plurality of devices include at least one of computing devices, storage devices or network devices;
a metric module, configured to determine a scalable converged resource utilization XUE of the multiple devices according to a dynamic metric manner and the first operating state, where the dynamic metric manner is used to indicate a manner in which the XUE of the multiple devices is determined using the constraint condition.
11. The apparatus of claim 10, wherein the first operating state is an operating state after an automated operation and maintenance policy is executed on the plurality of devices;
the acquisition module is further configured to acquire a second operating state of the data processing system before acquiring the first operating state of the data processing system;
the device further comprises:
the automatic operation and maintenance module is used for executing an automatic operation and maintenance strategy on the plurality of devices according to the second operation state and the constraint condition, wherein the automatic operation and maintenance strategy is used for indicating the operation executed on at least one device in the plurality of devices;
then, the metric module is configured to determine the XUE of the multiple devices according to the dynamic metric manner and the first operating state when the first operating state and the second operating state satisfy a metric condition.
12. The apparatus of claim 11, wherein the automated operation and maintenance module is configured to:
reasoning is carried out according to the second running state and the constraint condition by utilizing an Artificial Intelligence (AI) model to obtain the automatic operation and maintenance strategy output by the AI model;
executing the automated operation and maintenance policy on the plurality of devices.
13. A computing device comprising a processor, a memory;
the processor is configured to execute instructions stored in the memory to cause the computing device to perform the steps of the method of any of claims 1 to 9.
14. A chip, characterized in that it comprises a processor for performing the steps of the method according to any one of claims 1 to 9.
15. A data processing system comprising a plurality of devices, the data processing system being adapted to perform the steps of the method according to any one of claims 1 to 9.
CN202210915475.8A 2022-06-15 2022-07-30 Force calculation measurement method and device and related equipment Active CN115408150B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210682182X 2022-06-15
CN202210682182 2022-06-15

Publications (2)

Publication Number Publication Date
CN115408150A true CN115408150A (en) 2022-11-29
CN115408150B CN115408150B (en) 2023-08-22

Family

ID=84160339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210915475.8A Active CN115408150B (en) 2022-06-15 2022-07-30 Force calculation measurement method and device and related equipment

Country Status (1)

Country Link
CN (1) CN115408150B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180204116A1 (en) * 2017-01-19 2018-07-19 Google Inc. Optimizing data center controls using neural networks
CN110752598A (en) * 2019-10-25 2020-02-04 国网河南省电力公司电力科学研究院 Method and device for evaluating flexibility of multipoint distributed energy storage system
US20200050178A1 (en) * 2017-04-26 2020-02-13 Google Llc Integrating machine learning into control systems for industrial facilities
CN111901862A (en) * 2020-07-07 2020-11-06 西安交通大学 User clustering and power distribution method, device and medium based on deep Q network
CN114253941A (en) * 2020-09-25 2022-03-29 罗克韦尔自动化技术公司 Data modeling and asset management using an industrial information center

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180204116A1 (en) * 2017-01-19 2018-07-19 Google Inc. Optimizing data center controls using neural networks
US20200050178A1 (en) * 2017-04-26 2020-02-13 Google Llc Integrating machine learning into control systems for industrial facilities
CN110752598A (en) * 2019-10-25 2020-02-04 国网河南省电力公司电力科学研究院 Method and device for evaluating flexibility of multipoint distributed energy storage system
CN111901862A (en) * 2020-07-07 2020-11-06 西安交通大学 User clustering and power distribution method, device and medium based on deep Q network
CN114253941A (en) * 2020-09-25 2022-03-29 罗克韦尔自动化技术公司 Data modeling and asset management using an industrial information center

Also Published As

Publication number Publication date
CN115408150B (en) 2023-08-22

Similar Documents

Publication Publication Date Title
Toka et al. Machine learning-based scaling management for kubernetes edge clusters
US7937473B2 (en) Resource-amount calculation system, and method and program thereof
Wolski et al. Using bandwidth data to make computation offloading decisions
Maggio et al. Automated control of multiple software goals using multiple actuators
US20200219028A1 (en) Systems, methods, and media for distributing database queries across a metered virtual network
Gotin et al. Investigating performance metrics for scaling microservices in cloudiot-environments
CN111198808B (en) Method and device for predicting performance index, storage medium and electronic equipment
CN108228347A (en) The Docker self-adapting dispatching systems that a kind of task perceives
CN103383655A (en) Performance interference model for managing consolidated workloads in qos-aware clouds
US11886919B2 (en) Directing queries to nodes of a cluster of a container orchestration platform distributed across a host system and a hardware accelerator of the host system
JP2020504382A (en) Predictive asset optimization for computer resources
WO2017106718A1 (en) Method and apparatus for execution of distrubuted workflow processes
Schneider et al. Machine learning for dynamic resource allocation in network function virtualization
Lai et al. Oort: Informed participant selection for scalable federated learning
Shariffdeen et al. Workload and resource aware proactive auto-scaler for paas cloud
CA2637987C (en) Method for autonomic system management using adaptive allocation of resources
CN115391048A (en) Micro-service instance dynamic horizontal expansion and contraction method and system based on trend prediction
Rathfelder et al. Workload-aware system monitoring using performance predictions applied to a large-scale e-mail system
WO2022133690A1 (en) Efficient resource allocation for service level compliance
CN115408150B (en) Force calculation measurement method and device and related equipment
Chopra et al. Applications of fuzzy logic in cloud computing: A review
Gyeera et al. Regression Analysis of Predictions and Forecasts of Cloud Data Center KPIs Using the Boosted Decision Tree Algorithm
Huang The value-of-information in matching with queues
Mastelic et al. Data velocity scaling via dynamic monitoring frequency on ultrascale infrastructures
Cui et al. The learning stimulated sensing-transmission coordination via age of updates in distributed uav swarm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Song Binghua

Inventor after: Wang Fei

Inventor after: Cui Jin

Inventor before: Wang Fei

Inventor before: Song Binghua

Inventor before: Cui Jin

CB03 Change of inventor or designer information