WO2024036941A1 - 一种参数管理系统以及相关方法 - Google Patents

一种参数管理系统以及相关方法 Download PDF

Info

Publication number
WO2024036941A1
WO2024036941A1 PCT/CN2023/081469 CN2023081469W WO2024036941A1 WO 2024036941 A1 WO2024036941 A1 WO 2024036941A1 CN 2023081469 W CN2023081469 W CN 2023081469W WO 2024036941 A1 WO2024036941 A1 WO 2024036941A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
historical
parameter
application
load characteristics
Prior art date
Application number
PCT/CN2023/081469
Other languages
English (en)
French (fr)
Inventor
任宏帅
孙涛
刘俊洋
苗永辉
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202211288871.9A external-priority patent/CN117632673A/zh
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024036941A1 publication Critical patent/WO2024036941A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating

Definitions

  • the present application relates to the field of cloud computing technology, and in particular, to a parameter management system, a parameter management method, a computing device cluster, a computer-readable storage medium, and a computer program product.
  • the industry's mainstream parameter optimization algorithms usually use user authorization to directly use the user's environment (such as the production environment) for interactive verification.
  • users cannot use the service during the initial training or interactive verification process, which actually delays the user's time to use the service and affects the user's normal use.
  • this application provides a parameter management method.
  • This method recommends parameters based on historical data such as historical interaction records or historical running records, and reduces the number of online verifications to 0.
  • the target parameters corresponding to the current load characteristics can be directly output without incremental training.
  • prioritizing the search of historical interaction records to obtain the target parameters corresponding to the current load characteristics can greatly reduce the optimization time, solve the time constraint problem of online optimization, and meet the real-time optimization needs of the online environment.
  • This application also provides a parameter management system, a computing device cluster, a computer-readable storage medium and a computer program product corresponding to the above method.
  • this application provides a parameter management method.
  • This method can be executed by the parameter management system.
  • the parameter management system may be a software system, and the software system may be deployed in a computing device cluster.
  • the computing device cluster executes the program code of the software system, thereby executing the parameter management method of the present application.
  • the parameter management system may also be a hardware system. When the hardware system is running, it executes the parameter management method of the present application.
  • the parameter management system obtains the current load characteristics of the application in the existing network environment, and based on the application in the existing network environment The current load characteristics and the historical data applied in the live network environment are used to determine the target parameters corresponding to the current load characteristics.
  • the historical data includes historical interaction records or historical running records.
  • the historical interaction records include the first historical load characteristics and the first historical load characteristics.
  • Parameters recommended by load characteristics, historical operation records include second historical load characteristics and historical operation parameters, and then the parameter management system recommends target parameters to the user.
  • This method recommends parameters based on historical data such as historical interaction records or historical running records, reducing the number of online verifications. Whenever there is a new load feature, that is, the "current load feature" input, it can directly output the parameters corresponding to the current load feature without incremental training. target parameters. Among them, prioritizing the search of historical interaction records to obtain the target parameters corresponding to the current load characteristics can greatly reduce the optimization time, solve the time constraint problem of online optimization, and meet the real-time optimization needs of the online environment.
  • the target parameter includes a first target parameter under current hardware specifications.
  • the parameter management system can also obtain the current hardware specifications of the application in the current network environment.
  • the parameter management system can search based on the current load characteristics and current hardware specifications of the application in the current network environment. Use the historical interaction records to obtain the first target parameter; or, based on the historical running records, use a machine learning algorithm to infer the first target parameter corresponding to the current load characteristics and the current hardware specifications.
  • the parameter management system can infer the first target parameter corresponding to the current load characteristics and the current hardware specifications through a regression model based on historical operating records.
  • the regression model can include Gaussian model, Bayesian model or random forest model.
  • the parameter management system can fit the load characteristics and parameters in the historical operating records through Gaussian model, Bayesian model or random forest model, and then use the fitted model to infer the corresponding current load characteristics and current hardware specifications. the first target parameter.
  • This method performs regression model fitting based on the load characteristics and parameters in historical operating records, and then uses the fitted regression model for inference. It can more accurately infer the first target parameter corresponding to the current load characteristics, which is the parameter Configuration is provided for reference.
  • the parameter management system can also determine a performance simulator corresponding to the current hardware specifications.
  • the performance simulator is trained through historical running records, and then the parameter management system can drive the machine learning algorithm through the performance simulator to The first target parameter corresponding to the current load characteristics and the current hardware specifications is reasoned.
  • the parameter management system can also use mixed Latin hypercube sampling mixLHS to sample sub-data sets in historical running records that match the current hardware specifications to obtain data samples; the data samples can be verified in an offline environment. Obtain the real performance of the data sample; train the performance simulator corresponding to the current hardware specifications according to the data sample and the real performance.
  • hybrid Latin hypercube sampling means that some data samples are uniformly sampled using Latin hypercubes, and the other data samples are non-uniformly sampled using weighted adjustment windows.
  • hybrid Latin hypercube sampling can avoid the clustering of sample points in the parameter space, which affects the training of the performance simulator.
  • hybrid Latin hypercube sampling can achieve correct feedback of the performance simulator to the overall parameter space.
  • the parameter management system can characterize the user's business-related load characteristics as environmental variables during sampling, and combine them with the filtered parameters in the sub-data set to perform data sampling.
  • the performance simulator obtained by training based on the above data samples It can give accurate feedback to the client's changing usage scenarios, while also avoiding the additional overhead of training a model for each client scenario, and realizing an online optimization method that can be oriented to dynamic environments.
  • the target parameters include second target parameters under target hardware specifications.
  • the parameter management system may use a machine learning algorithm to infer the target hardware specifications and the second target parameters corresponding to the current load characteristics and the target hardware specifications based on the historical operating records.
  • the parameter management system can also recommend the target hardware specifications to the user.
  • this method also supports inferring the target hardware specifications corresponding to the current load characteristics and the second target parameters corresponding to the current load characteristics and target hardware specifications.
  • the parameter management system does not need to spend a lot of time retraining the AI model, thus solving the problem of changes in hardware specifications (changes in the underlying resources of the cluster).
  • the parameter management system can obtain the target hardware specifications corresponding to the current load characteristics through machine learning algorithm reasoning based on the historical operating records, and then based on the current load characteristics, the target hardware specifications and the historical data to determine the second target parameter.
  • the parameter management system may adopt a similar method to determine the first target parameter based on the current load characteristics and current hardware specifications. Specifically, the parameter management system can search the historical interaction records according to the current load characteristics and target hardware specifications of the application in the live network environment to obtain the second target parameters; or, based on the historical operation records, use machine learning to The algorithm infers the second target parameters corresponding to the current load characteristics and the target hardware specifications.
  • the parameter management system can also infer the target hardware specifications and second target parameters at one time.
  • the parameter management system can fit an AI model that takes load characteristics as input and hardware specifications and parameters as output. The AI model once infers the target hardware specifications and second target parameters corresponding to the current load characteristics.
  • the parameter management system can also infer the second target parameters first, and then infer the target hardware specifications. This application does not limit the method of inferring the target hardware specifications and the second target parameters.
  • This method uses historical operating records and uses machine learning algorithm reasoning to ensure the accuracy of the recommended target hardware specifications and second target parameters, and provides a reference for subsequent hardware specification adjustments and parameter configurations.
  • the parameter management system can also monitor the real performance of the application in the live network environment; when the real performance meets the trigger condition, execute the current load characteristics of the application in the live network environment and The step of determining target parameters corresponding to the current load characteristics using historical data of the application in the live network environment.
  • This actively monitors various performance indicators of the application uses AI algorithms to automatically determine the timing of parameter optimization, and proactively triggers parameter optimization services.
  • the parameter tuning service can be automatically triggered without manual intervention or task triggering. Since the parameters of the application can be tuned in a timely manner, the performance of the application throughout its life cycle is guaranteed.
  • the parameter management system can also verify the target parameters. When the verification is passed, the parameter management system configures the target parameters to the existing network environment.
  • This method performs security checks on parameters given by optimization services such as parameter tuning services, determines whether modifying parameters can achieve the desired effect, modifies parameters that meet the requirements and puts them online, and intercepts parameters that do not meet the requirements and feeds them back to the optimization Service re-recommended. This can ensure production safety.
  • the parameter management system can determine the safety range constraints corresponding to the target parameters; When the target parameter satisfies the safety range constraint, and the proximity of the parameters in the offline verification record or historical interaction record to the target parameter is greater than the preset value, it is determined that the target parameter has passed the verification.
  • the above verification strategy is a whitelist strategy and is mainly intended for applications without a disaster recovery strategy.
  • the parameter management system analyzes the stable operation parameter range based on the offline verification interaction data, and adds safety range constraints to the searched parameters.
  • the parameters recommended by the optimization service have parameters similar to the stable operation records in the offline verification records or historical interaction records, and If the security range constraints are met, the verification passes. If the verification passes, the above parameters can be configured to the online production environment. If the verification fails, the above parameters will not be configured to the production environment. This avoids production accidents caused by improper parameter configuration.
  • the application is deployed on multiple nodes in the cluster, and the parameter management system can configure the target parameters to at least one of the multiple nodes, and then monitor the performance of the application on the multiple nodes.
  • the real performance of at least one node when the real performance of the application on the at least one node is improved, it is determined that the target parameter verification is passed.
  • the above verification strategy belongs to the slave node verification strategy and is mainly oriented to applications deployed in multi-node clusters, such as middleware such as distributed message queues.
  • the cluster has multiple nodes.
  • the multi-node design is not only for capacity expansion, but also for disaster recovery. Even if one node goes down, there are corresponding copies on the other nodes and can still provide stable services.
  • This method has high reliability by performing verification on at least one node. After the verification is passed, the parameters are configured to other nodes to ensure overall security.
  • the parameter management system can first select a node and use safety steps to gradually adjust to the recommended target parameters, and constrain the safety range of the optimization parameters, which greatly avoids online service downtime. At the same time, monitor the performance changes of the node. When the node can run stably and the performance is improved, the parameters will be configured to take effect on the entire cluster. This can control the risk to a smaller range.
  • the application is deployed on the primary node and the backup node, and the parameter management system can configure the target parameters to the backup node; monitor the actual performance of the application on the backup node; when the If the actual performance improvement of the above application on the backup node is determined, it is determined that the target parameter verification is passed.
  • the above verification strategy belongs to the active and backup verification strategy, which is mainly aimed at applications with disaster recovery strategies of active and backup switching mechanisms.
  • the recommended target parameters are first modified on the backup node, and the performance status of the backup node is monitored at the same time.
  • the optimized target parameters are configured to the primary node. This way the risk can be controlled to a smaller range.
  • this application provides a parameter management system.
  • the system includes:
  • the communication module is used to obtain the current load characteristics of applications in the live network environment
  • a parameter tuning module configured to determine target parameters corresponding to the current load characteristics based on the current load characteristics of the application in the existing network environment and historical data of the application in the existing network environment, where the historical data includes historical interactions Records or historical operating records, the historical interaction records include first historical load characteristics and parameters recommended based on the first historical load characteristics, and the historical operating records include second historical load characteristics and historical operating parameters;
  • a recommendation module is used to recommend the target parameters to users.
  • the target parameter includes a first target parameter under current hardware specifications
  • the communication module is also used to:
  • the parameter tuning module is specifically used for:
  • the first target parameter corresponding to the current load characteristics and the current hardware specifications is inferred through a machine learning algorithm.
  • the parameter tuning module is specifically used to:
  • the machine learning algorithm is driven through the performance simulator to reason about the first target parameter corresponding to the current load characteristics and the current hardware specifications.
  • the system also includes:
  • the training module is used to use mixed Latin hypercube sampling mixLHS to sample the sub-dataset in the historical operation record that matches the current hardware specifications to obtain data samples; to verify the data samples in an offline environment to obtain the data The real performance of the sample; according to the data sample and the real performance, train the performance simulator corresponding to the current hardware specifications.
  • the target parameters include second target parameters under target hardware specifications
  • the parameter tuning module is specifically used for:
  • the target hardware specifications and the second target parameters corresponding to the current load characteristics and the target hardware specifications through a machine learning algorithm
  • the recommendation module is also used to:
  • the parameter tuning module is specifically used to:
  • the second target parameter is determined based on the current load characteristics, the target hardware specifications, and the historical data.
  • the system also includes:
  • a monitoring module used to monitor the actual performance of the application in the live network environment
  • the parameter tuning module is specifically configured to perform the step of determining the relationship between the current load characteristics of the application in the live network environment and the historical data of the application in the live network environment when the real performance meets the triggering condition. The steps of the target parameters corresponding to the current load characteristics.
  • the system also includes:
  • a verification module used to verify the target parameters
  • a configuration module configured to configure the target parameters to the existing network environment when the verification is passed.
  • the verification module is specifically used to:
  • the target parameter When the target parameter satisfies the safety range constraint, and the proximity of the parameters in the offline verification record or historical interaction record to the target parameter is greater than the preset value, it is determined that the target parameter has passed the verification.
  • the application is deployed on multiple nodes in the cluster, and the configuration module is also used to:
  • the system also includes:
  • a monitoring module configured to monitor the actual performance of the application on the at least one node
  • the verification module is specifically used for:
  • the application is deployed on the primary node and the backup node, and the verification module is specifically used to:
  • this application provides a computing device cluster.
  • the computing device cluster includes at least one computing device, the at least one computing device includes at least one processor and at least one memory, the at least one memory stores computer readable instructions, and the at least one processor executes the The computer readable instructions are configured to cause the computing device cluster to execute the method described in the first aspect.
  • the present application provides a non-transitory readable storage medium.
  • the computing device runs the aforementioned first aspect or any of the first aspects. Methods provided in Possible Implementations.
  • the storage medium stores the program.
  • the storage medium includes but is not limited to volatile memory, such as random access memory, and non-volatile memory, such as flash memory, hard disk drive (HDD), and solid state drive (SSD).
  • the present application provides a computer program product.
  • the computer program product includes computer instructions. When executed by a computing device, the computing device runs the aforementioned first aspect or any possible implementation of the first aspect. provided method.
  • Figure 1 is a schematic architectural diagram of a parameter management system provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of a parameter optimization triggering mechanism provided by an embodiment of the present application.
  • Figure 3 is a schematic diagram of a parameter security checking mechanism provided by an embodiment of the present application.
  • Figure 4 is a flow chart of a parameter management method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of data storage provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of parameter recommendation provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of a dynamic load simulation provided by an embodiment of the present application.
  • Figure 8 is a schematic diagram of data sampling provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of a hybrid Latin hypercube sampling provided by an embodiment of the present application.
  • Figure 10 is a schematic diagram of an application scenario of a parameter management method provided by an embodiment of the present application.
  • Figure 11 is a schematic structural diagram of a parameter management system provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Figure 13 is a schematic structural diagram of a computing device cluster provided by an embodiment of the present application.
  • Figure 14 is a schematic structural diagram of another computing device cluster provided by an embodiment of the present application.
  • Figure 15 is a schematic structural diagram of another computing device cluster provided by an embodiment of the present application.
  • first and second in the embodiments of this application are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, features defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • Parameter optimization refers to optimizing the software stack configuration parameters (also called software parameters and application parameters) of an application so that the performance of the application meets user expectations.
  • the application can be a database-based application, a big data computing engine-based application, or a middleware-based application.
  • the software stack configuration parameters can be different.
  • the application's software stack configuration parameters can include client parameters and server parameters.
  • Client parameters include but are not limited to batch size batch_size, timeout limit linger_time and number of partitions num.partitions.
  • Server parameters include but are not limited to the number of network threads num.network.threads, the number of input and output threads num.io.threads, and pull copies. Number num.replica.fetchers etc.
  • Performance can be characterized by one or more indicators including throughput, latency, computing resource occupancy, input and output (IO) resource occupancy, and network bandwidth.
  • the computing resources may include a central processing unit (CPU) and a graphics processing unit (GPU), and the IO resources may be disk IO.
  • Parameter optimization can be divided into offline optimization and online optimization.
  • the running load of cloud applications usually changes in real time.
  • the running load can be parameterized, a large amount of interactive data (including load parameters) can be collected in the early stage, and the AI model can be trained using machine learning methods, and then the AI model can be used in the user's running environment (i.e., the user's Environment) conducts a small amount of incremental interactive verification to achieve near-real-time optimization parameter recommendations.
  • the parameter management system may be a software system, and the software system may be deployed in a computing device cluster.
  • the computing device cluster executes the program code of the software system, thereby executing the parameter management method of the present application.
  • the parameter management system may also be a hardware system. When the hardware system is running, it executes the parameter management method of the present application.
  • the parameter management system may be a cluster of computing devices with parameter management capabilities. For ease of description, the following uses the parameter management system as an example of a software system.
  • the parameter management system can obtain the current load characteristics of the application in the live network environment, and then determine the target parameters corresponding to the current load characteristics based on the current load characteristics of the application in the live network environment and the historical data of the application in the live network environment, where , historical data includes historical interaction records or historical running records.
  • Historical interaction records are the history of parameter optimization. Historical records, historical interaction records include first historical load characteristics and parameters recommended based on the first historical load characteristics, historical operating records include second historical load characteristics and historical operating parameters, and then the parameter management system recommends the above target parameters to the user.
  • This method recommends parameters based on historical data such as historical interaction records or historical running records, and reduces the number of online verifications to 0. Whenever there is a new load feature, that is, the "current load feature" input, it can directly output the same data as the current load without incremental training.
  • the target parameter corresponding to the feature Prioritizing the search of historical interaction records to obtain the target parameters corresponding to the current load characteristics can greatly reduce the optimization time, solve the time constraint problem of online optimization, and meet the real-time optimization needs of the online environment.
  • this method can also simulate dynamic loads, characterize the user's business-related load characteristics as environment variables, and combine them with filtered parameters for data sampling to construct a performance simulator.
  • the performance simulator can be used to predict the client.
  • this method also supports inferring the target hardware specifications and the target parameters under the target hardware specifications without spending a lot of time retraining the AI model, thereby solving the problem of hardware specification changes (changes in the underlying resources of the cluster).
  • the parameter management system 10 includes a parameter optimization device 100.
  • the parameter optimization device 100 includes a parameter tuning module 102 (also called parameter optimizer, parameter tuner) and data storage. Module 104.
  • the data storage module 104 is used to store historical data applied in the live network environment 20 .
  • the historical data includes historical interaction records or historical operating records.
  • the historical interaction records include first historical load characteristics and parameters recommended based on the first historical load characteristics.
  • the historical operating records include second historical load characteristics and historical operating parameters. Among them, the load characteristics, operating parameters, and hardware specifications can be collected by the client agent in the existing network environment 20 .
  • the parameter tuning module 102 is used to obtain the current load characteristics of the application in the live network environment 20, and determine the current load characteristics corresponding to the current load characteristics according to the current load characteristics of the application in the live network environment 20 and the historical data of the application in the live network environment 20.
  • Target parameter recommend the target parameter to the user.
  • users can configure parameters according to target parameters.
  • the user can configure the application parameters as target parameters through the client agent in the live network environment 20 .
  • the parameter management system 10 also includes a service monitoring device 200 .
  • the service monitoring device 200 is used to monitor the actual performance of the application in the live network environment 20 .
  • the parameter optimization device 100 executes the method based on the current load characteristics of the application in the live network environment 20 and the historical data of the application in the live network environment 20 to determine the corresponding load characteristics of the current application. target parameters.
  • the business monitoring device 200 may include a performance evaluation model.
  • the business monitoring device 200 may automatically determine the timing of parameter optimization through the performance evaluation model and actively trigger the parameter optimization service.
  • the parameter optimization device 100 also includes a performance simulator 106 trained based on historical operating data.
  • the performance evaluation model can determine the performance simulator 106 corresponding to the current specifications, and input the current load characteristics and current specifications into the performance simulator 106 to obtain Predictive performance (also called simulated performance).
  • Predictive performance also called simulated performance.
  • the business monitoring device 200 in the embodiment of the present application can actively monitor real performance and obtain data based on the performance simulator 106.
  • Predictive performance enables automatic triggering of tuning based on real performance and predicted performance.
  • the service monitoring device 200 can determine the difference between the predicted performance and the actual performance. When the difference is greater than a preset value, the optimization is triggered.
  • the parameter management system 10 may also include a parameter safety checking device 300 .
  • the parameter safety checking device 300 is used to perform safety checking on the target parameters recommended by the parameter optimization device 100 .
  • the parameter safety inspection device 300 can combine methods such as grayscale verification and simulation performance evaluation to determine whether configuring parameters as target parameters can achieve the expected effect.
  • the parameter security checking device 300 can perform virtual detection based on the performance evaluation model, and perform live network detection (live network grayscale detection) in the live network environment 20 .
  • live network detection supports multiple methods, such as whitelist verification, active and backup verification, or slave node verification.
  • this application proposes a method based on a performance evaluator and several parameter security verification strategies to achieve non-interruptive security verification.
  • Each set of parameters that need to go online first enters the performance evaluator before going online. After passing the verification, grayscale verification is carried out in conjunction with the above strategy.
  • the method includes:
  • the parameter management system 10 obtains the current load characteristics applied in the existing network environment.
  • the live network environment also called the production environment, refers to the environment used to formally provide external services to customers. This environment usually turns off error reporting and turns on error logs.
  • Applications deployed in the live network environment can receive tasks. For example, a database-based application can receive query tasks. This task can also be called the application load.
  • the parameter management system 10 can obtain the current load characteristics of the application in the existing network environment based on the attributes of the tasks received in the current time period.
  • the parameter management system 10 can collect attributes of tasks received in the current time period through an agent (such as a client agent) deployed in the existing network environment, thereby obtaining the current load characteristics of the application in the existing network environment.
  • the load characteristics may include one or more of the number of tasks received per unit time, the average data volume of task data, and the distribution of task data.
  • S404 The parameter management system 10 obtains the predicted performance of the application in the live network environment through the performance simulator.
  • Performance simulators are used to simulate the performance of applications under specified load characteristics and specified hardware specifications.
  • the performance simulator takes load characteristics and hardware specifications as input and predicts performance as output.
  • the parameter management system 10 can not only collect current load characteristics through the agent deployed in the live network environment, but also collect current hardware specifications through the agent.
  • the parameter management system 10 can input the current load characteristics and current hardware specifications into the performance simulator, and perform performance simulation through the performance simulator, thereby obtaining the predicted performance of the application in the current network environment.
  • the predicted performance applied in the live network environment may include one or more of throughput, delay, computing resource occupancy, IO resource occupancy, and network bandwidth predicted by the performance simulator.
  • S406 The parameter management system 10 monitors the actual performance of the application in the live network environment. When the real performance meets the trigger condition, S408 is executed.
  • the parameter management system 10 can deploy a performance monitoring agent in the live network environment, for example, deploy the performance monitoring agent on the client and the server, and then monitor the real performance of the application in the live network environment through the performance monitoring agent.
  • the parameter management system 10 can determine whether to trigger parameter optimization based on real performance.
  • the parameter management system 10 may determine the difference between the predicted performance and the actual performance. When the difference is large When it is at the preset value, it indicates that the trigger conditions are met and parameter optimization can be triggered. In other embodiments, the parameter management system 10 can determine the ratio between the predicted performance and the actual performance. When the ratio is greater than the preset value, it indicates that the triggering condition is met and parameter optimization can be triggered. It should be noted that the preset value used for comparison with the difference value and the preset value used for comparison with the ratio value can be set to different values, and the embodiment of the present application does not limit this.
  • the above-mentioned S404 to S406 are optional steps in the embodiment of the present application, and the above-mentioned S404 and S406 may not be executed when performing the method of the embodiment of the present application.
  • the parameter management system 10 can also trigger parameter optimization through other triggering methods.
  • the parameter management system 10 determines the target parameters corresponding to the current load characteristics based on the current load characteristics applied in the existing network environment and historical data applied in the existing network environment.
  • the target parameters may include first target parameters under current hardware specifications.
  • the target parameter may also include the second target parameter under the target hardware specification.
  • the target hardware specification may be a hardware specification that corresponds to the current load characteristics and enables performance to be fully utilized, for example, a hardware specification that maximizes performance, which is also called an optimal hardware specification.
  • Historical data applied in the live network environment can include historical interaction records or historical running records.
  • the historical interaction record includes the first historical load characteristics and parameters recommended based on the first historical load characteristics
  • the historical operation record includes the second historical load characteristics and historical operation parameters. Based on this, the parameter management system 10 can determine the first target parameter in various ways. Each is explained in detail below.
  • the first implementation method may be that the parameter management system 10 obtains the current hardware specifications applied in the existing network environment.
  • the parameter management system 10 may search historical interaction records based on the current load characteristics and current hardware specifications applied in the existing network environment. Get the first target parameters.
  • the second implementation method may be that the parameter management system 10 may use a machine learning algorithm to infer the first target parameter corresponding to the current load characteristics and current hardware specifications based on historical operating records.
  • machine learning algorithms include regression algorithms, which include but are not limited to Gaussian fitting, random forest, and Bayesian fitting.
  • the parameter management system 10 may construct a regression model based on a machine learning algorithm such as a regression algorithm, and use the regression model to infer the first target parameter.
  • the parameter management system 10 can also determine a performance simulator corresponding to the current hardware specifications. The performance simulator is trained through historical running records, and then the parameter management system 10 can drive the machine learning algorithm through the performance simulator.
  • the parameter management system 10 can simulate performance through a performance simulator, use the predicted performance output by the performance simulator as feedback, and update the regression model based on the feedback, without updating the regression model based on feedback from the existing network environment.
  • the parameter management system 10 designs a data storage mechanism.
  • the parameter management system 10 can perform offline solution optimization on common specifications in an offline environment, obtain corresponding optimization parameters, and store them.
  • the parameter management system 10 can store corresponding load characteristics, hardware specifications, and parameters into historical interaction records.
  • the parameter management system 10 can search for parameters corresponding to load characteristics and hardware specifications in historical interaction records.
  • the search is successful, the first target parameters can be directly obtained.
  • the search is unsuccessful, the parameter management system 10 can Then based on historical operating data, the first target parameters are obtained through machine learning algorithm reasoning, which will greatly reduce the optimization time.
  • the parameter management system 10 can train a performance simulator corresponding to the hardware specifications in an offline environment. After completing a new round of search optimization, the first target parameter and the performance of the corresponding hardware specifications can also be combined.
  • the simulator can be used for storage. As users grow and accumulate data, the speed of parameter optimization will become faster and faster, and the quality of optimized parameters will also become higher and higher.
  • the parameter management system 10 can also obtain the second target parameter through inference using a machine learning algorithm.
  • the parameter management system 10 can directly infer the target hardware specifications and the second target parameters under the target hardware specifications (for example, parameters and specifications are optimized simultaneously to obtain optimal specifications and optimal parameters).
  • the parameter management system 10 can also first deduce the target hardware specifications (the optimal specifications in FIG. 6 ), and then determine the second target hardware specifications in a manner similar to determining the first target parameters.
  • Target parameters optimal parameters in Figure 6
  • the parameter management system 10 can also infer the second target parameters first, and then infer the target hardware specifications.
  • the following is an example in which the parameter management system 10 uses a machine learning algorithm to infer the target hardware specifications and the second target parameters under the target hardware specifications at one time.
  • the parameter management system 10 can infer the target hardware specifications and the second target parameters corresponding to the current load characteristics and the target hardware specifications through a machine learning algorithm based on historical operating records.
  • the historical operating records can include second load characteristics, historical hardware specifications, and historical operating parameters.
  • the parameter management system 10 can construct an AI model through a machine learning algorithm based on the historical operating records.
  • the AI model uses load characteristics as input and hardware specifications. , parameters are output. In this way, the parameter management system 10 can input the current load characteristics into the trained AI model, and obtain the hardware specifications and parameters output by the AI model as the target hardware specifications and second target parameters.
  • S410 The parameter management system 10 recommends target parameters to the user.
  • the parameter management system 10 may also recommend the target hardware specification to the user.
  • S412 The parameter management system 10 verifies the target parameters. When the verification passes, S414 is executed; when the verification fails, S408 is returned.
  • S414 The parameter management system 10 configures the target parameters to the existing network environment.
  • the parameter management system 10 determines the target parameters, it cannot guarantee the specific performance of the target parameters in the real live network environment because it has not been run in the real live network environment. Directly putting the target parameters online may bring risks to the business. Considering that interrupting service verification will bring poor user experience, the parameter management system 10 proposes a performance simulator-based method combined with several parameter security verification strategies to implement non-interruptive security verification.
  • S414 is executed to configure the target parameters to the live network environment.
  • S408 is returned to perform parameter optimization again.
  • the target parameters that need to be online can first enter the performance simulator for virtual environment verification before going online. After passing the verification, they can be verified with any one or more of the whitelist strategy, slave node verification strategy, and active and backup verification strategies. Grayscale verification on existing network.
  • the whitelist strategy is usually aimed at applications without disaster recovery strategies. Specifically, it determines the security range constraints corresponding to the target parameters.
  • the security range constraints When the target parameters meet the security range constraints, and the offline verification records (records of interactive verification in an offline environment) or historical interaction records The closeness between the parameter and the target parameter is greater than the preset value, and it is determined that the target parameter verification is passed.
  • the safety range constraint can be a parameter range obtained based on interactive data analysis of offline verification that enables the application to run stably.
  • the target parameters exist in offline verification records or parameters that are similar to stable operation records in historical interaction records, and at the same time meet the security range constraints, indicating that the grayscale verification of the existing network has passed, the target parameters can be configured to the existing network environment.
  • the slave node verification strategy is usually oriented to applications deployed on multiple nodes in the cluster, such as those based on distributed message queues, etc.
  • the cluster deploying the above application includes multiple nodes.
  • the multi-node design is not only for capacity expansion, but also for disaster recovery. Even if one node goes down, there are corresponding copies on other nodes and can still provide stable services.
  • the parameter management system 10 can configure the target parameter to at least one node among the plurality of nodes, and then monitor the actual performance of the application on the at least one node. When the actual performance of the application on the at least one node is improved, the parameter management system 10 determines that the target parameter verification is passed.
  • the parameter management system 10 configures the target parameter to a node among multiple nodes, it can select a node to gradually adjust to the target parameter using a safe step size, and constrain the safe range of the target parameter, thereby maximizing the Avoid online service downtime.
  • the parameter management system 10 monitors the performance changes of the node. When the node can operate stably and the performance is improved, the target parameters can be configured to take effect on the entire cluster.
  • the active-standby verification strategy is usually aimed at applications with disaster recovery strategies of active-standby switching mechanisms.
  • the parameter management system 10 can configure the target parameters to the backup node and monitor the actual performance of the application on the backup node. When the actual performance of the application on the standby node is improved, the parameter management system 10 determines that the target parameter verification passes, and accordingly, the parameter management system 10 can configure the target parameters to the primary node.
  • the parameter management system 10 can use a variety of strategies to perform grayscale verification on the existing network.
  • the parameter management system 10 can use the active and backup verification strategy to first modify the recommended target parameters on the backup node.
  • the whitelist strategy can also be used.
  • the performance of the application on the backup node of the parameter management system 10 can be improved.
  • the application on the backup node can achieve performance improvements, and then configure the target parameters to the primary node.
  • S412 to S414 are optional steps in the embodiment of the present application, and the above steps may not be performed when performing the method of the embodiment of the present application. For example, when the confidence level of the target parameters is high, you can also directly configure the target parameters to the live network environment.
  • embodiments of this application provide a parameter management method.
  • This method recommends parameters based on historical data such as historical interaction records or historical running records, and reduces the number of online verifications to 0.
  • the "current load characteristics" input can directly output the target parameters corresponding to the current load characteristics without incremental training.
  • prioritizing the search of historical interaction records to obtain the target parameters corresponding to the current load characteristics can greatly reduce the optimization time, solve the time constraint problem of online optimization, and meet the real-time optimization needs of the online environment.
  • this method can analyze business characteristics, and while recommending the target parameters of the application, it also recommends target hardware specifications suitable for the current business characteristics.
  • This method also supports security checks on target parameters, and combines grayscale verification and simulation performance evaluation to determine whether modified parameters can achieve the expected results. Parameters that meet the requirements are modified and put online through gradual replacement. For those that do not meet the requirements, Parameters are intercepted and re-recommended, thus ensuring the security of parameter online.
  • embodiments of the present application may construct a performance simulator offline for search optimization.
  • the parameter management system 10 can characterize the business-related load characteristics as environmental variables, and combine them with the filtered parameters to perform data sampling to construct a performance simulator.
  • the performance simulator obtained using this method can It provides accurate feedback on the client's changing usage scenarios, while also avoiding the additional overhead of training a model for each client scenario, and realizing an online optimization method that can face dynamic environments.
  • the user can choose to open configurable parameters, including customer One or more of client configuration parameters (denoted as config client ) and server configuration parameters (denoted as config server ).
  • client configuration parameters denoted as config client
  • server configuration parameters denoted as config server
  • the user chooses to characterize the load as an environment variable (denoted as envs client ).
  • the parameter management system 10 can combine the client configuration parameters (i.e., client parameters) and the server configuration parameters (i.e., server parameters), perform sensitivity analysis on the parameters using feature screening methods such as Pearson correlation coefficients, and screen out the top n key parameters. Parameters, specifically as follows:
  • config imp represents the key parameters
  • List config represents the parameter list sorted by sensitivity
  • LHS Latin hypercube sampling
  • the hierarchical window size of LHS sampling can be weighted and adjusted according to the data distribution of historical users' environmental variables and parameters, as shown in Figure 9 Shown on the right.
  • the parameter management system 10 may use mixed Latin hypercube sampling mixLHS for sampling. Specifically, referring to Figure 9, the total amount of data is set to D, where D/N can use uniform sampling, and D(N-1)/N can use weighted non-uniform sampling.
  • the parameter management system 10 can combine the two into a training data set.
  • the performance simulator trained using this training data set can respond more accurately to common client sample distributions.
  • the parameter management system 10 uses the mixLHS method to sample key parameters to obtain data sample X, as shown below:
  • bounds client env represents the boundaries (value range) of environment variables
  • bounds imp config represents the boundaries of key parameters
  • the parameter management system 10 can perform verification on the data sample X in an offline environment to obtain the real performance of the data sample as the real feedback Y corresponding to the data sample Train the performance simulator corresponding to the current hardware specifications. The details are as follows:
  • DMS Distributed Message Queuing
  • the DMS cluster is divided into client (including producers and consumers) and server multi-node (broker) deployment.
  • the parameters to be optimized are client parameters batch_size, linger_time, partitions, etc. and server parameters num.network.threads, num.io.threads, num.replica.fetchers, etc.
  • Client environment variables business settings related to user usage scenarios, used to describe the workload of the business scenario.
  • Service initialization collect statistics on common hardware specifications of the existing network from the data storage center, sample data in an offline environment, then construct a training data set based on the sampled data, and train a performance simulator with corresponding specifications based on the training data set.
  • the business monitoring device 200 deploys performance monitoring agents on the client and server respectively to monitor performance indicators (throughput, delay, CPU usage, disk IO, network bandwidth) of the business scenario.
  • Business monitoring The device 200 feeds back the business scenario information to the parameter optimization device 100.
  • the fed back business scenario information includes but is not limited to client environment variables, parameters, and hardware specifications.
  • the parameter optimization device 100 reads the client environment variables and inputs them into the performance simulator of corresponding specifications to obtain predicted performance.
  • the business monitoring device 200 can determine whether the current scenario needs to be optimized based on the actual performance and the predicted performance.
  • Parameter optimization When it is determined that optimization is needed after evaluation by the performance simulator, the parameter optimization device 100 is triggered to perform parameter optimization.
  • the parameter optimization device 100 can preferentially use the client environment variables and server hardware specifications as conditions to index the data storage module 104 to find whether there is a historical optimization parameter record with the same specifications that can be reused, and if so, directly feed back the searched first
  • a target parameter is sent to the parameter safety checking device 300 . If it does not exist, the performance simulator 106 of the corresponding specification is searched. If the corresponding specification does not exist, it is recorded and supplemented in the offline environment. At the same time, a Gaussian fitting algorithm is used to predict a set of first target parameters.
  • the parameter optimization device 100 can use the Gaussian fitting algorithm to predict the target hardware specifications suitable for the current environmental variables based on historical data such as historical operating data, and then search for the second target parameters based on the target hardware specifications, and then combine the new target hardware specifications with The second target parameter is recommended to the user.
  • the target parameters Before the target parameters are put online, they need to be verified by the performance evaluation model and then tested by the whitelist parameter security range.
  • the whitelist parameter range here can be manually configured by the user.
  • the target parameters will first be configured in a single node of the distributed message queue cluster. At this time, observe whether the performance of the node reaches the expected effect and runs normally. If it is normal, other node parameters will be gradually replaced. If a node is down, Or if the expected effect is not achieved, the current node parameters are rolled back and fed back to the parameter optimization device 100 to recalculate the target parameters.
  • the parameter management method in the embodiment of this application is generally applicable to cloud application software, and is also suitable for parameter optimization of databases, middleware, and big data computing engines.
  • the specific optimization process is as follows:
  • the application form can be a cluster deployment application or a single node deployment application, and a client agent is deployed on the application side.
  • the client agent is not limited to a specific installation package. If the parameter management system 10 itself has an API interface for collecting required data and modifying parameters, it can be regarded as a logical client agent.
  • the parameter management system 10 infers the AI models used based on historical data, including but not limited to regression models (such as Gaussian fitting, random forest, etc.).
  • the sampling method used by the parameter management system 10 is not limited to the expected improvement (EI) collection function, upper confidence bound (UCB) collection function or LHS, etc.
  • This method introduces a business monitoring device 200, which can replace the traditional manual and task triggering methods, actively monitor various performance indicators of application services, and use AI algorithms to automatically determine the timing of parameter optimization and actively trigger parameter optimization services without manual intervention or The task is triggered.
  • This method also introduces a parameter optimization device 100.
  • the parameter tuner Through the parameter tuner, the online interactive verification cost required by the online optimization algorithm can be further compressed, and online parameter tuning with zero incremental interaction can be achieved, and the user's business characteristics can be analyzed. It not only recommends the optimal parameters for the application, but also recommends the best parameters suitable for the current business characteristics. Excellent specifications.
  • this method introduces a parameter security check module to conduct security checks on the parameters given by the optimization service.
  • the embodiment of the present application also provides a parameter management system 10 as mentioned above.
  • the structure of the parameter management system 10 is introduced in detail below.
  • the parameter management system 10 includes:
  • the communication module 101 is used to obtain the current load characteristics of applications in the existing network environment
  • the parameter tuning module 102 is configured to determine the target parameters corresponding to the current load characteristics according to the current load characteristics of the application in the existing network environment and the historical data of the application in the existing network environment.
  • the historical data includes historical data. Interaction records or historical operating records, the historical interaction records include first historical load characteristics and parameters recommended according to the first historical load characteristics, and the historical operating records include second historical load characteristics and historical operating parameters;
  • Recommendation module 103 is used to recommend the target parameters to users.
  • the communication module 101 and the recommendation module 103 may be modules in the parameter optimization device 100 shown in FIG. 1 , or may be modules in other devices.
  • the recommendation module 103 may also be a module in the parameter security checking device 300 shown in FIG. 1 .
  • the parameter management system 10 may be divided in different ways as needed, and the embodiments of the present application do not limit this.
  • the parameter management system 10 may not include the business monitoring device 200 and the parameter security checking device 300 , and the functions of the business monitoring device 200 and the parameter security checking device 300 may be implemented by the parameter optimization device 100 .
  • the communication module 101, the parameter tuning module 102, and the recommendation module 103 can be implemented by hardware modules or software modules.
  • the communication module 101 can be implemented through a transceiver or software on the transceiver.
  • the parameter tuning module 102 and the recommendation module 103 can be implemented by a computing device or a computing engine on the computing device.
  • the parameter tuning module 102 is taken as an example for description.
  • the parameter tuning module 102 may be an application program or application program module, such as a computing engine, running on a computing device or a cluster of computing devices.
  • the parameter tuning module 102 may include at least one computing device, such as a server.
  • the parameter tuning module 102 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL), or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • the target parameter includes a first target parameter under current hardware specifications
  • the communication module 101 is also used to:
  • the parameter tuning module 102 is specifically used to:
  • the first target parameter corresponding to the current load characteristics and the current hardware specifications is inferred through a machine learning algorithm.
  • the parameter management system 10 may also include a data storage module 104.
  • the data storage module 104 is used to store historical data, such as historical interaction records or historical operation records.
  • the parameter tuning module 102 can be used in the current network environment according to the application. The current load characteristics and current hardware specifications are searched for the historical interaction records stored in the data storage module 104, thereby obtaining the first target parameter.
  • the parameter tuning module 102 can also obtain historical operating records from the data storage module 104, and use a machine learning algorithm to reason about the first target parameter corresponding to the current load characteristics and current hardware specifications.
  • the above-mentioned data storage module 104 can be implemented by software or hardware.
  • the data storage module 104 may include a storage engine.
  • the data storage module 104 may include at least one storage device with data storage capabilities.
  • the parameter tuning module 102 is specifically used to:
  • the machine learning algorithm is driven through the performance simulator 106 to reason about the first target parameters corresponding to the current load characteristics and the current hardware specifications.
  • system 10 further includes:
  • the training module 108 is used to use mixed Latin hypercube sampling mixLHS to sample the sub-dataset in the historical operation record that matches the current hardware specifications to obtain data samples; verify the data samples in an offline environment to obtain the The real performance of the data sample; according to the data sample and the real performance, train the performance simulator corresponding to the current hardware specifications.
  • the training module 108 can be implemented by a hardware module or by a software module.
  • the training module 108 may be an application program or application program module, such as a computing engine, etc., running on a computing device or a cluster of computing devices.
  • the training module 108 may include at least one computing device, such as a server.
  • the training module 108 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be implemented by a complex program logic device CPLD, a field programmable gate array FPGA, a general array logic GAL, or any combination thereof.
  • the training module 108 may also be a module in other devices.
  • the training module 108 may also be a module in a separate training device.
  • the target parameters include second target parameters under target hardware specifications
  • the parameter tuning module 102 is specifically used to:
  • the target hardware specifications and the second target parameters corresponding to the current load characteristics and the target hardware specifications through a machine learning algorithm
  • the recommendation module 103 is also used to:
  • the parameter tuning module 102 is specifically used to:
  • the second target parameter is determined based on the current load characteristics, the target hardware specifications, and the historical data.
  • system 10 further includes:
  • Monitoring module 202 is used to monitor the actual performance of the application in the live network environment
  • the parameter tuning module 102 is specifically configured to perform the step of determining the parameters according to the current load characteristics of the application in the existing network environment and the historical data of the application in the existing network environment when the real performance meets the triggering condition. Describe the steps for target parameters corresponding to the current load characteristics.
  • the monitoring module 202 may be a module in the service monitoring device 200 shown in FIG. 1 . Further, the service monitoring device 200 may also include a communication module 201. When the actual performance meets the trigger condition, the communication module 201 instructs the parameter tuning module 102 to determine the current load characteristics based on the current load characteristics of the application in the existing network environment and the historical data of the application in the existing network environment. The steps corresponding to the target parameters.
  • the communication module 201 is also used to obtain the current load characteristics of the application in the live network environment, and input them into the performance evaluation model in the service monitoring device 200 .
  • the performance evaluation model can call the performance simulator 106 to perform performance evaluation and obtain predicted performance.
  • the communication module 201 can instruct the parameter tuning module 102 to perform parameter tuning.
  • system 10 further includes:
  • Verification module 302 used to verify the target parameters
  • the configuration module 304 is configured to configure the target parameters to the existing network environment when the verification is passed.
  • the verification module 302 and the configuration module 304 may be modules in the parameter security checking device 300 shown in FIG. 1 .
  • the verification module 302 and the configuration module 304 can be implemented through hardware modules or through software modules.
  • the verification module 302 and the configuration module 304 may be an application program or application program module, such as a computing engine, running on a computing device or a cluster of computing devices.
  • the verification module 302 and the configuration module 304 may include at least one computing device, such as a server.
  • the verification module 302 and the configuration module 304 may also be devices implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be implemented by a complex program logic device CPLD, a field programmable gate array FPGA, a general array logic GAL, or any combination thereof.
  • the verification module 302 is specifically used to:
  • the target parameter When the target parameter satisfies the safety range constraint, and the proximity of the parameters in the offline verification record or historical interaction record to the target parameter is greater than the preset value, it is determined that the target parameter has passed the verification.
  • the application is deployed on multiple nodes in the cluster, and the configuration module 304 is also used to:
  • the system 10 also includes:
  • Monitoring module 202 is used to monitor the actual performance of the application on the at least one node
  • the verification module 302 is specifically used to:
  • the application is deployed on the primary node and the backup node, and the verification module 302 has Body used for:
  • computing device 1200 includes: bus 1202, processor 1204, memory 1206, and communication interface 1208.
  • the processor 1204, the memory 1206 and the communication interface 1208 communicate through the bus 1202.
  • Computing device 1200 may be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 1200.
  • the bus 1202 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 12, but it does not mean that there is only one bus or one type of bus.
  • Bus 1202 may include a path that carries information between various components of computing device 1200 (eg, memory 1206, processor 1204, communications interface 1208).
  • the processor 1204 may include a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • Memory 1206 may include volatile memory, such as random access memory (RAM).
  • the processor 1204 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD) or solid state drive (SSD). drive, SSD).
  • ROM read-only memory
  • HDD hard disk drive
  • SSD solid state drive
  • the memory 1206 stores executable program code, and the processor 1204 executes the executable program code to implement the aforementioned parameter management method. Specifically, the memory 1206 stores instructions for the parameter management system 10 to execute the parameter management method.
  • the communication interface 1208 uses transceiver modules such as, but not limited to, network interface cards and transceivers to implement communication between the computing device 1200 and other devices or communication networks.
  • the computing device cluster includes at least one computing device 1200.
  • the computing device 1200 may be a server, such as a central server, an edge server, or a local server in a local data center.
  • the computing device 1200 may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
  • the computing device cluster includes at least one computing device 1200.
  • the memory 1206 in one or more computing devices 1200 in the computing device cluster may store instructions for the same parameter management system 10 to perform the parameter management method.
  • one or more computing devices 1200 in the computing device cluster may also be used to execute part of the instructions of the parameter management system 10 for executing the parameter management method.
  • a combination of one or more computing devices 1200 may collectively execute instructions of the parameter management system 10 for performing the parameter management method.
  • the memory 1206 in different computing devices 1200 in the computing device cluster can store different instructions for executing part of the functions of the parameter management system 10 .
  • FIG 14 shows a possible implementation. As shown in Figure 14, two computing devices 1200A and 1200B are connected through a communication interface 1208.
  • the memory in the computing device 1200A stores instructions for executing the functions of the parameter optimization device 100.
  • the memory in the computing device 1200A stores instructions for executing the functions of the communication module 101, the parameter tuning module 102, and the recommendation module 103.
  • the memory in the computing device 1200A also stores instructions for executing the functions of the data storage module 104, the performance simulator 106, and the training module 108.
  • the memory in the computing device 1200A also stores instructions for executing the functions of the business monitoring device 200 .
  • the memory in the computing device 1200A stores instructions for executing the functions of the communication module 201 and the monitoring module 202 .
  • Instructions for performing the functions of the parameter safety checking device 300 are stored on the memory in the computing device 1200B.
  • the memory in the computing device 1200B stores instructions for performing the functions of the verification module 302 and the configuration module 304.
  • the memories 1206 of the computing devices 1200A and 1200B collectively store instructions for the parameter management system 10 to perform the parameter management method.
  • connection method between computing device clusters shown in Figure 14 may be based on the fact that the parameter management method provided by this application requires the business monitoring device 200 to monitor real performance to trigger the parameter optimization device 100 to perform parameter tuning. Therefore, it is considered that the functions realized by the parameter optimization device 100 and the business monitoring device 200 are executed by the computing device 1200A, and the functions realized by the parameter security checking device 300 are executed by the computing device 1200B.
  • computing device 1200A shown in FIG. 14 may also be performed by multiple computing devices 1200.
  • computing device 1200B may also be performed by multiple computing devices 1200.
  • one or more computing devices in a cluster of computing devices may be connected through a network.
  • the network may be a wide area network or a local area network, etc.
  • Figure 15 shows a possible implementation.
  • two computing devices 1200C and 1200D are connected through a network.
  • the connection to the network is made through a communication interface in each computing device.
  • instructions for performing the functions of the parameter optimization device 100 are stored in the memory 1206 of the computing device 1200C.
  • the memory 1206 in the computing device 1200C also stores instructions for executing the functions of the business monitoring device 200 .
  • instructions for executing the functions of the parameter safety checking device 300 are stored in the memory 1206 of the computing device 1200D.
  • connection method between computing device clusters shown in Figure 15 can be based on the fact that the cache management method provided by this application requires the business monitoring device 200 to monitor the real performance to trigger the parameter optimization device 100 to perform parameter tuning. Therefore, it is considered that the functions realized by the parameter optimization device 100 and the business monitoring device 200 are executed by the computing device 1200C, and the functions realized by the parameter security checking device 300 are executed by the computing device 1200D.
  • computing device 1200C shown in FIG. 15 may also be performed by multiple computing devices 1200.
  • the functions of computing device 1200D may also be performed by multiple computing devices 1200.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that a computing device can store or a data storage device such as a data center that contains one or more available media.
  • the usable media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc.
  • the computer-readable storage medium includes instructions that instruct the computing device to execute the above-described application to the parameter management system 10 for performing the parameter management method.
  • An embodiment of the present application also provides a computer program product containing instructions.
  • the computer program product may be a software or program product containing instructions capable of running on the computing device 1200 or stored in any available medium.
  • the computer program product is run on at least one computing device 1200, at least one computing device 1200 is caused to execute the above parameter management method.

Abstract

本申请提供了一种参数管理方法,包括:获取应用在现网环境的当前负载特征,根据应用在现网环境的当前负载特征以及应用在现网环境的历史数据,如历史交互记录或历史运行记录,确定与当前负载特征对应的目标参数,其中,历史交互记录包括第一历史负载特征以及根据第一历史负载特征推荐的参数,历史运行记录包括第二历史负载特征和历史运行参数,然后向用户推荐目标参数。该方法减少了在线交互验证次数,每当有新的负载特征输入,可以无需增量训练直接输出与该负载特征对应的目标参数,保障了用户能够及时使用服务,提升了服务体验。

Description

一种参数管理系统以及相关方法
本申请要求于2022年08月17日提交中国国家知识产权局、申请号为202210987574.7、发明名称为“一种参数管理系统以及相关方法”的中国专利申请的优先权,以及要求于2022年10月20日提交中国国家知识产权局、申请号为202211288871.9、发明名称为“一种参数管理系统以及相关方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及云计算技术领域,尤其涉及一种参数管理系统、参数管理方法、计算设备集群、计算机可读存储介质以及计算机程序产品。
背景技术
随着云计算技术的不断发展,各种提供云计算业务的云平台应运而生。随着云平台的业务增长涌现了大量的应用,为了让这些应用能够在各种应用场景充分发挥性能,这些应用通常提供了大量的可配置参数。
可配置参数的优化(也称作参数优化)是一个非确定性多项式困难问题(non-deterministic polynomial,NP-hard)问题。人工优化的专家需要大量的经验培养,成本是十分昂贵的。伴随着人工智能(artificial intelligence,AI)的发展,业界的云服务供应商开始寻求利用AI能力挖掘云上应用的自动化参数配置的能力。例如,学术界在数据库、大数据、中间件等应用场景提出了相应的参数优化算法,同时,工业界也已经出现了成熟的落地应用。
业界主流的参数优化算法通常采取用户授权的方式直接使用用户的环境(如生产环境)进行交互验证。然而在初始化训练或交互验证的过程中用户是无法使用服务的,这实际上已经延迟了用户使用服务的时间,影响了用户的正常使用。
发明内容
针对以上问题,本申请提供一种参数管理方法,该方法基于历史数据如历史交互记录或历史运行记录进行参数推荐,将在线验证次数降为0,每当有新的负载特征即“当前负载特征”输入,可以无需增量训练直接输出与当前负载特征对应的目标参数。其中,优先搜索历史交互记录,以获取与当前负载特征对应的目标参数,可以极大地缩减寻优时间,解决在线优化的时间约束问题,满足在线环境的即时性优化需求。本申请还提供与上述方法对应的参数管理系统、计算设备集群、计算机可读存储介质以及计算机程序产品。
第一方面,本申请提供一种参数管理方法。该方法可以由参数管理系统执行。参数管理系统可以是软件系统,该软件系统可以部署在计算设备集群中,计算设备集群执行软件系统的程序代码,从而执行本申请的参数管理方法。在一些可能的实现方式中,参数管理系统也可以是硬件系统,该硬件系统运行时,执行本申请的参数管理方法。
具体地,参数管理系统获取应用在现网环境的当前负载特征,根据应用在现网环境的 当前负载特征以及应用在现网环境的历史数据,确定与当前负载特征对应的目标参数,其中,历史数据包括历史交互记录或历史运行记录,历史交互记录包括第一历史负载特征以及根据第一历史负载特征推荐的参数,历史运行记录包括第二历史负载特征和历史运行参数,然后参数管理系统向用户推荐目标参数。
该方法基于历史数据如历史交互记录或历史运行记录进行参数推荐,降低在线验证次数,每当有新的负载特征即“当前负载特征”输入,可以无需增量训练直接输出与当前负载特征对应的目标参数。其中,优先搜索历史交互记录,以获取与当前负载特征对应的目标参数,可以极大地缩减寻优时间,解决在线优化的时间约束问题,满足在线环境的即时性优化需求。
在一些可能的实现方式中,所述目标参数包括在当前硬件规格下的第一目标参数。参数管理系统还可以获取所述应用在现网环境的当前硬件规格,相应地,参数管理系统在确定第一目标参数时,可以根据所述应用在现网环境的当前负载特征和当前硬件规格搜索所述历史交互记录,获得所述第一目标参数;或者,根据所述历史运行记录,通过机器学习算法推理与所述当前负载特征和所述当前硬件规格对应的所述第一目标参数。
如此,可以实现在规格不支持变更或变更成本较大的业务场景中,对指定的硬件规格下的参数进行推荐,在不增加成本的情况下,提高应用的性能。
在一些可能的实现方式中,参数管理系统可以根据历史运行记录,通过回归模型,推理与当前负载特征和所述当前硬件规格对应的所述第一目标参数。其中,回归模型可以包括高斯模型、贝叶斯模型或随机森林模型。参数管理系统可以根据历史运行记录中的负载特征、参数,通过高斯模型、贝叶斯模型或随机森林模型进行拟合,然后根据拟合后的模型可以推理出与当前负载特征、当前硬件规格对应的第一目标参数。
该方法通过基于历史运行记录中的负载特征、参数,进行回归模型拟合,然后采用拟合后的回归模型进行推理,能够较为精准地推理出与当前负载特征对应的第一目标参数,为参数配置提供参考。
在一些可能的实现方式中,参数管理系统还可以确定与当前硬件规格对应的性能模拟器,该性能模拟器通过历史运行记录训练得到,然后参数管理系统可以通过性能模拟器驱动机器学习算法,以推理与当前负载特征和所述当前硬件规格对应的所述第一目标参数。
由于采用了性能模拟器输出的预测性能作为反馈,而不需要等待生产环境中的真实性能作为反馈,降低了与生产环境的交互次数,如此能够缩减寻优时间,解决在线优化的时间约束问题,满足在线环境的即时性优化需求。
在一些可能的实现方式中,参数管理系统还可以使用混合拉丁超立方采样mixLHS对历史运行记录中与当前硬件规格匹配的子数据集进行采样,获得数据样本;在离线环境对数据样本进行验证,获得所述数据样本的真实性能;根据所述数据样本和所述真实性能,训练与所述当前硬件规格对应的所述性能模拟器。
其中,混合拉丁超立方采样是指一部分数据样本采用拉丁超立方进行均匀采样得到,另一部分数据样本采用加权调整窗口进行非均匀采样得到。一方面混合拉丁超立方采样可以避免在参数空间内呈现样本点聚集的特点,影响性能模拟器的训练,另一方面混合拉丁超立方采样可以实现性能模拟器对整体参数空间的正确反馈。
进一步地,参数管理系统可以在采样时,将用户的业务相关的负载特征刻画为环境变量,与子数据集中经过筛选的参数结合一起进行数据采样,如此,基于上述数据样本训练得到的性能模拟器能够对客户端多变的使用场景做出准确的反馈,同时也避免了为每个客户端场景都训练一个模型的额外开销,实现了能够面向动态环境的在线优化方法。
在一些可能的实现方式中,所述目标参数包括在目标硬件规格下的第二目标参数。参数管理系统可以根据所述历史运行记录,通过机器学习算法推理所述目标硬件规格以及与所述当前负载特征和所述目标硬件规格对应的所述第二目标参数。相应地,参数管理系统还可以向所述用户推荐所述目标硬件规格。
对于支持规格变更的业务场景,该方法还支持推理出与当前负载特征对应的目标硬件规格以及与当前负载特征、目标硬件规格对应的第二目标参数。参数管理系统无需花费大量时间重新训练AI模型,由此解决了硬件规格变更(集群底层资源变更)的问题。
在一些可能的实现方式中,参数管理系统可以根据所述历史运行记录,通过机器学习算法推理获得与所述当前负载特征对应的目标硬件规格,然后根据所述当前负载特征、所述目标硬件规格以及所述历史数据,确定所述第二目标参数。
其中,参数管理系统在根据当前负载特征、目标硬件规格,确定第二目标参数时,可以采用根据当前负载特征、当前硬件规格,确定第一目标参数类似的方式。具体地,参数管理系统可以根据所述应用在现网环境的当前负载特征和目标硬件规格搜索所述历史交互记录,获得所述第二目标参数;或者,根据所述历史运行记录,通过机器学习算法推理与所述当前负载特征和所述目标硬件规格对应的所述第二目标参数。
需要说明的是,参数管理系统也可以一次性地推理出目标硬件规格、第二目标参数,例如,参数管理系统可以拟合以负载特征为输入、以硬件规格、参数为输出的AI模型,通过该AI模型一次性地推理出当前负载特征对应的目标硬件规格和第二目标参数。参数管理系统也可以先推理出第二目标参数,然后推理出目标硬件规格。本申请对推理出目标硬件规格和第二目标参数的方式不作限制。
该方法通过历史运行记录,采用机器学习算法推理的方式,保障了推荐的目标硬件规格、第二目标参数的准确度,为后续的硬件规格调整、参数配置提供参考。
在一些可能的实现方式中,参数管理系统还可以监控所述应用在现网环境的真实性能;当所述真实性能满足触发条件,执行所述根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数的步骤。
该通过主动监控应用的各项性能指标,利用AI算法自动判断参数优化的时机,主动触发参数优化服务。如此能够实现自动触发参数调优服务,不需要人工干预或任务触发,由于能够及时对应用的参数进行调优,保障了应用在整个生命周期的性能。
在一些可能的实现方式中,参数管理系统还可以对所述目标参数进行验证。当验证通过,参数管理系统再将所述目标参数配置至所述现网环境。
该方法通过对优化服务如参数调优服务给出的参数进行安全性检查,判断修改参数能否获得预期效果,对符合要求的参数进行修改上线,对于不满足要求的参数进行拦截并反馈给优化服务重新推荐。如此可以保障生产安全。
在一些可能的实现方式中,参数管理系统可以确定所述目标参数对应的安全范围约束; 当所述目标参数满足所述安全范围约束,且离线验证记录或历史交互记录中的参数与所述目标参数的接近程度大于预设值,确定所述目标参数验证通过。
上述验证策略属于白名单策略,主要面向没有容灾策略应用。参数管理系统根据离线验证的交互数据分析出稳定运行的参数范围,为搜索的参数增加安全范围约束,当优化服务推荐的参数存在离线验证记录或历史交互记录中稳定运行的记录相近的参数,且满足安全范围约束,则验证通过。验证通过,上述参数可以被配置到在线的生产环境,验证不通过,则上述参数不会被配置到生产环境,如此避免了参数配置不当引起的生产事故。
在一些可能的实现方式中,所述应用部署在集群中的多个节点,参数管理系统可以将所述目标参数配置至所述多个节点中的至少一个节点,然后监控所述应用在所述至少一个节点的真实性能;当所述应用在所述至少一个节点的真实性能提升,则确定所述目标参数验证通过。
上述验证策略属于从节点验证策略,主要面向集群多节点部署的应用,例如分布式消息队列等中间件。集群有多个节点,多节点的设计除了为了扩容之外,也有容灾的作用,即使其中一个节点宕机,另外的节点也存在对应副本,仍然能够提供稳定的服务。该方法通过在至少一个节点上进行验证,具有较高可靠性,在验证通过后,再将参数配置到其他节点,保障了整体的安全性。
进一步地,参数管理系统在配置参数时,可以先选择一个节点使用安全步长逐步向推荐的目标参数调整,以及约束了优化参数的安全范围,极大限度地避免了线上服务宕机。同时监控该节点的性能变化,当节点能够稳定运行且性能得到提升,此时再将参数配置到整个集群上生效,如此可以将风险控制在较小范围。
在一些可能的实现方式中,所述应用部署在主节点和备用节点,参数管理系统可以将所述目标参数配置至所述备用节点;监控所述应用在所述备用节点的真实性能;当所述应用在所述备用节点的真实性能提升,则确定所述目标参数验证通过。
上述验证策略属于主备验证策略,主要面向存在主备切换机制容灾策略的应用。该方法将推荐的目标参数首先在备用节点上进行修改,同时监控备用节点性能状态,当备份节点能够取得性能提升再将优化的目标参数配置到主节点。如此可以将风险控制在较小范围。
第二方面,本申请提供一种参数管理系统。所述系统包括:
通信模块,用于获取应用在现网环境的当前负载特征;
参数调优模块,用于根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数,所述历史数据包括历史交互记录或历史运行记录,所述历史交互记录包括第一历史负载特征以及根据第一历史负载特征推荐的参数,所述历史运行记录包括第二历史负载特征和历史运行参数;
推荐模块,用于向用户推荐所述目标参数。
在一些可能的实现方式中,所述目标参数包括在当前硬件规格下的第一目标参数;
所述通信模块还用于:
获取所述应用在现网环境的当前硬件规格;
所述参数调优模块具体用于:
根据所述应用在现网环境的当前负载特征和当前硬件规格搜索所述历史交互记录,获 得所述第一目标参数;或者,
根据所述历史运行记录,通过机器学习算法推理与所述当前负载特征和所述当前硬件规格对应的所述第一目标参数。
在一些可能的实现方式中,所述参数调优模块具体用于:
确定与所述当前硬件规格对应的性能模拟器,所述性能模拟器通过所述历史运行记录训练得到;
通过所述性能模拟器驱动所述机器学习算法,以推理与所述当前负载特征和所述当前硬件规格对应的所述第一目标参数。
在一些可能的实现方式中,所述系统还包括:
训练模块,用于使用混合拉丁超立方采样mixLHS对所述历史运行记录中与所述当前硬件规格匹配的子数据集进行采样,获得数据样本;在离线环境对数据样本进行验证,获得所述数据样本的真实性能;根据所述数据样本和所述真实性能,训练与所述当前硬件规格对应的所述性能模拟器。
在一些可能的实现方式中,所述目标参数包括在目标硬件规格下的第二目标参数;
所述参数调优模块具体用于:
根据所述历史运行记录,通过机器学习算法推理所述目标硬件规格以及与所述当前负载特征和所述目标硬件规格对应的所述第二目标参数;
所述推荐模块还用于:
向所述用户推荐所述目标硬件规格。
在一些可能的实现方式中,所述参数调优模块具体用于:
根据所述历史运行记录,通过机器学习算法推理获得与所述当前负载特征对应的目标硬件规格;
根据所述当前负载特征、所述目标硬件规格以及所述历史数据,确定所述第二目标参数。
在一些可能的实现方式中,所述系统还包括:
监控模块,用于监控所述应用在现网环境的真实性能;
所述参数调优模块,具体用于当所述真实性能满足触发条件,执行所述根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数的步骤。
在一些可能的实现方式中,所述系统还包括:
验证模块,用于对所述目标参数进行验证;
配置模块,用于当验证通过,将所述目标参数配置至所述现网环境。
在一些可能的实现方式中,所述验证模块具体用于:
确定所述目标参数对应的安全范围约束;
当所述目标参数满足所述安全范围约束,且离线验证记录或历史交互记录中的参数与所述目标参数的接近程度大于预设值,确定所述目标参数验证通过。
在一些可能的实现方式中,所述应用部署在集群中的多个节点,所述配置模块还用于:
将所述目标参数配置至所述多个节点中的至少一个节点;
所述系统还包括:
监控模块,用于监控所述应用在所述至少一个节点的真实性能;
所述验证模块具体用于:
当所述应用在所述至少一个节点的真实性能提升,则确定所述目标参数验证通过。
在一些可能的实现方式中,所述应用部署在主节点和备用节点,所述验证模块具体用于:
将所述目标参数配置至所述备用节点;
监控所述应用在所述备用节点的真实性能;
当所述应用在所述备用节点的真实性能提升,则确定所述目标参数验证通过。
第三方面,本申请提供了一种计算设备集群。所述计算设备集群包括至少一台计算设备,所述至少一台计算设备包括至少一个处理器和至少一个存储器,所述至少一个存储器中存储有计算机可读指令,所述至少一个处理器执行所述计算机可读指令,以使得所述计算设备集群执行如第一方面所述的方法。
第四方面,本申请提供了一种非瞬态的可读存储介质,所述非瞬态的可读存储介质被计算设备执行时,所述计算设备运行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该存储介质中存储了程序。该存储介质包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)。
第五方面,本申请提供了一种计算机程序产品,所述计算机程序产品包括计算机指令,在被计算设备执行时,所述计算设备运行前述第一方面或第一方面的任意可能的实现方式中提供的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需使用的附图作以简单地介绍。
图1为本申请实施例提供的一种参数管理系统的架构示意图;
图2为本申请实施例提供的一种参数优化触发机制的示意图;
图3为本申请实施例提供的一种参数安全检查机制的示意图;
图4为本申请实施例提供的一种参数管理方法的流程图;
图5为本申请实施例提供的一种数据存储的示意图;
图6为本申请实施例提供的一种参数推荐的原理图;
图7为本申请实施例提供的一种动态负载模拟的示意图;
图8为本申请实施例提供的一种数据采样的示意图;
图9为本申请实施例提供的一种混合拉丁超立方采样的示意图;
图10为本申请实施例提供的一种参数管理方法的应用场景示意图;
图11为本申请实施例提供的一种参数管理系统的结构示意图;
图12为本申请实施例提供的一种计算设备的结构示意图;
图13为本申请实施例提供的一种计算设备集群的结构示意图;
图14为本申请实施例提供的另一种计算设备集群的结构示意图;
图15为本申请实施例提供的另一种计算设备集群的结构示意图。
具体实施方式
本申请实施例中的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
首先对本申请实施例中所涉及到的一些技术术语进行介绍。
参数优化,是指对应用的软件栈配置参数(也称作软件参数、应用参数)进行优化,以使得应用的性能达到用户的期望。其中,应用可以是基于数据库的应用,基于大数据计算引擎的应用,或者是基于中间件的应用。针对不同应用,软件栈配置参数可以不同。
以应用为基于分布式消息队列(distributed message service,DMS)等中间件进行通信的应用示例说明,应用的软件栈配置参数可以包括客户端参数和服务端参数。客户端参数包括但不限于批大小batch_size,超时限制linger_time和分区数num.partitions,服务端参数包括但不限于网络线程数量num.network.threads,输入输出线程数量num.io.threads,拉取副本数量num.replica.fetchers等。
性能可以通过吞吐量,时延,计算资源占用率,输入输出(input output,IO)资源占用率,网络带宽中的一种或多种指标表征。其中,计算资源可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU),IO资源可以为磁盘IO。
参数优化可以分为离线优化和在线优化。受到云上应用实时性的业务特点影响,云上应用的运行负载通常是实时变化的。为了解决运行负载动态变化的问题,可以将运行负载参数化,通过前期收集大量的交互数据(包括负载参数),使用机器学习的方法训练出AI模型,然后在用户的运行态环境(即用户的环境)进行少量的增量交互验证,实现近实时态的优化参数推荐。
针对主流的参数优化算法采取用户授权的方式直接使用用户的环境进行交互验证,使得初始化训练或交互验证的过程中用户无法使用服务,延迟了用户使用服务的时间,影响了服务体验的问题,本申请提供了一种参数管理方法。该方法可以由参数管理系统执行。
参数管理系统可以是软件系统,该软件系统可以部署在计算设备集群中,计算设备集群执行软件系统的程序代码,从而执行本申请的参数管理方法。在一些可能的实现方式中,参数管理系统也可以是硬件系统,该硬件系统运行时,执行本申请的参数管理方法。在一些示例中,参数管理系统可以是具有参数管理功能的计算设备集群。为了便于描述,下文以参数管理系统为软件系统示例说明。
具体地,参数管理系统可以获取应用在现网环境的当前负载特征,然后根据应用在现网环境的当前负载特征以及应用在现网环境的历史数据,确定与当前负载特征对应的目标参数,其中,历史数据包括历史交互记录或历史运行记录,历史交互记录为参数优化的历 史记录,历史交互记录包括第一历史负载特征以及根据第一历史负载特征推荐的参数,历史运行记录包括第二历史负载特征和历史运行参数,接着参数管理系统向用户推荐上述目标参数。
该方法基于历史数据如历史交互记录或历史运行记录进行参数推荐,将在线验证次数降为0,每当有新的负载特征即“当前负载特征”输入,可以无需增量训练直接输出与当前负载特征对应的目标参数。其中,优先搜索历史交互记录,以获取与当前负载特征对应的目标参数,可以极大地缩减寻优时间,解决在线优化的时间约束问题,满足在线环境的即时性优化需求。
而且,该方法还能模拟动态负载,将用户的业务相关的负载特征刻画为环境变量,与经过筛选的参数结合一起进行数据采样,构造性能模拟器,使用该方法得到性能模拟器能够对客户端多变的使用场景做出准确的反馈,同时也避免了为每个客户端场景都训练一个模型的额外开销,实现了能够面向动态环境的在线优化方法。进一步地,该方法还支持推理出目标硬件规格,以及目标硬件规格下的目标参数,无需花费大量时间重新训练AI模型,由此解决了硬件规格变更(集群底层资源变更)的问题。
为了使得本申请的技术方案更加清楚、易于理解,下面结合附图对本申请的参数管理系统的架构进行介绍。
参见图1所示的参数管理系统的架构示意图,参数管理系统10包括参数优化装置100,参数优化装置100包括参数调优模块102(也称作参数寻优器、参数调优器)和数据存储模块104。
数据存储模块104用于存储应用在现网环境20的历史数据。该历史数据包括历史交互记录或历史运行记录,历史交互记录包括第一历史负载特征以及根据第一历史负载特征推荐的参数,所述历史运行记录包括第二历史负载特征和历史运行参数。其中,负载特征、运行参数、硬件规格可以由现网环境20中的客户端代理采集得到。
参数调优模块102用于获取应用在现网环境20的当前负载特征,根据应用在现网环境20的当前负载特征以及所述应用在现网环境20的历史数据,确定与当前负载特征对应的目标参数,向用户推荐该目标参数。如此,用户可以根据目标参数进行参数配置。例如,用户可以通过现网环境20中的客户端代理将应用的参数配置为目标参数。
在一些可能的实现方式中,参数管理系统10还包括业务监控装置200。业务监控装置200用于监控所述应用在现网环境20的真实性能。当所述真实性能满足触发条件,参数优化装置100执行所述根据所述应用在现网环境20的当前负载特征以及所述应用在现网环境20的历史数据,确定与所述当前负载特征对应的目标参数的步骤。
具体地,业务监控装置200可以包括性能评估模型,业务监控装置200可以通过性能评估模型自动判断参数优化的时机,主动触发参数优化服务。例如,参数优化装置100还包括基于历史运行数据训练得到的性能模拟器106,性能评估模型可以确定与当前规格对应的性能模拟器106,将当前负载特征和当前规格输入性能模拟器106,从而获得预测性能(也可以称作模拟性能)。如图2所示,区别于传统的由用户触发调优或者任务触发调优,本申请实施例的业务监控装置200可以主动监控真实性能,以及基于性能模拟器106获得 预测性能,根据真实性能和预测性能实现自动触发调优。例如,业务监控装置200可以确定预测性能和真实性能的差值,当差值大于预设值,则触发调优。
在一些可能的实现方式中,参数管理系统10还可以包括参数安全检查装置300。参数安全检查装置300用于对参数优化装置100推荐的目标参数进行安全性检查。具体地,参数安全检查装置300可以结合灰度验证与模拟性能评估等方法,判断将参数配置为目标参数能否获得预期效果。如图3所示,参数安全检查装置300可以基于性能评估模型进行虚拟检测,在现网环境20进行现网检测(现网灰度检测)。其中,现网检测支持多种方式,例如可以支持白名单校验、主备校验或者从节点验证。对符合要求的目标参数采取逐步更换的方法修改上线,对于不满足要求的目标参数进行拦截并反馈给参数优化装置100重新推荐。考虑到中断业务验证会带来糟糕的用户体验,本申请提出了一种基于性能评估器的方法配合几种参数安全校验策略实现了非中断式安全验证。每组需要上线的参数在上线之前先进入性能评估器,验证通过之后再配合上述策略进行灰度验证。
接下来,将从参数管理系统10的角度,对本申请实施例的参数管理方法进行介绍。
参见图4所示的参数管理方法的流程图,该方法包括:
S402:参数管理系统10获取应用在现网环境的当前负载特征。
现网环境,也称作生产环境,是指用于正式提供对外服务给客户使用的环境,该环境通常会关掉错误报告、打开错误日志。部署在现网环境的应用可以接收任务,例如基于数据库的应用可以接收查询任务,该任务也可以称作应用的负载。参数管理系统10可以根据当前时间段接收到的任务的属性,获得应用在现网环境的当前负载特征。
具体地,参数管理系统10可以通过部署在现网环境的代理(如客户端代理),采集当前时间段接收到的任务的属性,从而获得应用在现网环境的当前负载特征。该负载特征可以包括单位时间接收到的任务数量、任务数据的平均数据量、任务数据的分布中的一种或多种。
S404:参数管理系统10通过性能模拟器,获得应用在现网环境的预测性能。
性能模拟器用于模拟应用在指定负载特征、指定硬件规格下的性能。性能模拟器以负载特征、硬件规格为输入,以预测性能为输出。参数管理系统10不仅可以通过部署在现网环境的代理,采集当前负载特征,还可以通过该代理采集当前硬件规格。参数管理系统10可以将当前负载特征、当前硬件规格输入性能模拟器,通过性能模拟器进行性能模拟,从而获得应用在现网环境的预测性能。
其中,应用在现网环境的预测性能可以包括性能模拟器预测得到的吞吐量、时延、计算资源占用率、IO资源占用率、网络带宽中的一种或多种。
S406:参数管理系统10监控应用在现网环境的真实性能。当真实性能满足触发条件,执行S408。
具体地,参数管理系统10可以在现网环境部署性能监控代理,例如是在客户端和服务端部署性能监控代理,然后通过性能监控代理监控应用在现网环境的真实性能。参数管理系统10可以根据真实性能,确定是否触发参数优化。
在一些实施例中,参数管理系统10可以确定预测性能与真实性能的差值,当该差值大 于预设值时,表明满足触发条件,可以触发参数优化。在另一些实施例中,参数管理系统10可以确定预测性能与真实性能的比值,当比值大于预设值,表明满足触发条件,可以触发参数优化。需要说明,用于和差值比较的预设值以及用于和比值比较的预设值可以设置为不同数值,本申请实施例对此不作限制。
上述S404至S406为本申请实施例的可选步骤,执行本申请实施例的方法也可以不执行上述S404、S406。例如,参数管理系统10也可以通过其他触发方式,触发参数优化。
S408:参数管理系统10根据应用在现网环境的当前负载特征以及应用在现网环境的历史数据,确定与当前负载特征对应的目标参数。
目标参数可以包括在当前硬件规格下的第一目标参数。当现网环境为弹性环境,也即支持硬件规格变更时,目标参数也可以包括在目标硬件规格下的第二目标参数。其中,目标硬件规格可以是与当前负载特征对应的、使得性能得到充分发挥的硬件规格,例如是使得性能最大化的硬件规格,也称作最佳硬件规格。
应用在现网环境的历史数据可以包括历史交互记录或历史运行记录。其中,历史交互记录包括第一历史负载特征以及根据第一历史负载特征推荐的参数,历史运行记录包括第二历史负载特征和历史运行参数。基于此,参数管理系统10可以通过多种方式确定第一目标参数。下面分别进行详细说明。
第一种实现方式可以为,参数管理系统10获取应用在现网环境的当前硬件规格,相应地,参数管理系统10可以根据应用在现网环境的当前负载特征和当前硬件规格搜索历史交互记录,获得第一目标参数。
第二种实现方式可以为,参数管理系统10可以根据历史运行记录,通过机器学习算法推理与当前负载特征和当前硬件规格对应的所述第一目标参数。其中,机器学习算法包括回归算法,该算法包括但不限于高斯拟合、随机森林、贝叶斯拟合。参数管理系统10在通过机器学习算法推理第一目标参数时,可以是基于机器学习算法如回归算法构建回归模型,利用回归模型推理得到第一目标参数。在一些实施例中,参数管理系统10还可以确定与当前硬件规格对应的性能模拟器,该性能模拟器通过历史运行记录训练得到,然后参数管理系统10可以通过性能模拟器驱动所述机器学习算法,以推理与所述当前负载特征和所述当前硬件规格对应的所述第一目标参数。即参数管理系统10可以通过性能模拟器对性能进行模拟,并将性能模拟器输出的预测性能作为反馈,根据该反馈更新回归模型,而无需基于现网环境的反馈更新回归模型。
为了解决在线优化配置的时间约束问题,满足在线环境的即时性优化需求,参数管理系统10设计了数据存储机制。参见图5,在初始化阶段,参数管理系统10可以在离线环境会对常见规格进行离线求解优化,得到对应的优化参数并进行存储。参数管理系统10可以将相应的负载特征、硬件规格、参数存储到历史交互记录。如此,参数管理系统10在进行参数优化时,可以在历史交互记录中搜索对应负载特征、硬件规格的参数,当搜索成功,可以直接获得第一目标参数,当搜索不成功,参数管理系统10可以再根据历史运行数据,通过机器学习算法推理得到第一目标参数,如此将极大地缩减寻优时间。
每经过一段时间的数据积累,参数管理系统10可以在离线环境训练对应硬件规格的性能模拟器,当完成了新一轮的搜索优化后,也可以将第一目标参数和对应的硬件规格的性 能模拟器进行存储,随着用户增长数据积累,参数优化的速度将会越来越快,同时优化参数质量也会越来越高。
与确定第一目标参数类似,参数管理系统10也可以通过机器学习算法推理得到第二目标参数。在一些实施例中,参见图6,参数管理系统10可以直接推理得到目标硬件规格和在目标硬件规格下的第二目标参数(例如参数与规格同时优化,获得最优规格和最优参数)。在另一些实施例中,参数管理系统10也可以先推理出目标硬件规格(如图6中的最优规格),然后采用与确定第一目标参数类似的方式,确定目标硬件规格下的第二目标参数(如图6中的最优参数)。或者,参数管理系统10也可以先推理出第二目标参数,然后推理出目标硬件规格。
下面以参数管理系统10通过机器学习算法,一次性推理出目标硬件规格以及在目标硬件规格下的第二目标参数进行示例说明。
参数管理系统10可以根据历史运行记录,通过机器学习算法推理目标硬件规格以及与所述当前负载特征和所述目标硬件规格对应的所述第二目标参数。具体地,历史运行记录可以第二负载特征、历史硬件规格、历史运行参数,参数管理系统10可以根据历史运行记录,通过机器学习算法构建AI模型,该AI模型以负载特征为输入,以硬件规格、参数为输出,如此,参数管理系统10可以将当前负载特征输入训练好的AI模型,获取AI模型输出的硬件规格、参数,作为目标硬件规格、第二目标参数。
S410:参数管理系统10向用户推荐目标参数。
当目标参数包括在目标硬件规格下的第二目标参数时,参数管理系统10还可以向用户推荐目标硬件规格。
S412:参数管理系统10对所述目标参数进行验证。当验证通过,则执行S414;当验证不通过,则返回S408。
S414:参数管理系统10将目标参数配置至现网环境。
参数管理系统10在确定目标参数之后,由于没有在真实的现网环境下运行无法保证该目标参数在现网环境下的具体表现,直接上线目标参数可以给业务带来风险。考虑到中断业务验证会带来糟糕的用户体验,参数管理系统10提出了一种基于性能模拟器的方法配合几种参数安全校验策略实现非中断式安全验证。当验证通过,则执行S414将目标参数配置至现网环境,当验证不通过,则返回至S408重新进行参数优化。
具体地,需要上线的目标参数在上线之前可以先进入性能模拟器进行虚拟环境验证,验证通过之后,再配合白名单策略、从节点验证策略、主备验证策略中的任意一种或多种进行现网灰度验证。
白名单策略,通常面向没有容灾策略的应用,具体是确定目标参数对应的安全范围约束,当目标参数满足安全范围约束,且离线验证记录(离线环境进行交互验证的记录)或历史交互记录中的参数与目标参数的接近程度大于预设值,确定目标参数验证通过。其中,安全范围约束可以为根据离线验证的交互数据分析得到的、使得应用稳定运行的参数范围。当目标参数存在离线验证记录或历史交互记录中稳定运行的记录相近的参数,同时满足安全范围约束,表征现网灰度验证通过,可以将目标参数配置至现网环境。
从节点验证策略,通常面向集群多节点部署的应用,例如是基于分布式消息队列等中 间件DMS的应用。部署上述应用的集群包括多个节点,其多节点的设计除了为了扩容之外,也有容灾的作用,即使其中一个节点宕机,另外的节点也存在对应副本,仍然能够提供稳定的服务。基于此,在进行现网灰度验证时,参数管理系统10可以将所述目标参数配置至所述多个节点中的至少一个节点,然后监控所述应用在所述至少一个节点的真实性能。当所述应用在所述至少一个节点的真实性能提升,则参数管理系统10确定所述目标参数验证通过。
需要说明的是,参数管理系统10在将目标参数配置值多个节点中的一个节点时,可以选择一个节点使用安全步长逐步向目标参数调整,以及约束目标参数的安全范围,极大限度地避免线上服务宕机。参数管理系统10监控该节点的性能变化,当节点能够稳定运行且性能得到提升,可以再将目标参数配置到整个集群上生效。
主备验证策略,通常面向存在主备切换机制容灾策略的应用。具体地,参数管理系统10可以将所述目标参数配置至所述备用节点,监控所述应用在所述备用节点的真实性能。当所述应用在所述备用节点的真实性能提升,则参数管理系统10确定所述目标参数验证通过,相应地,参数管理系统10可以再将目标参数配置到主节点。
需要说明的是,参数管理系统10可以使用多种策略进行现网灰度验证。例如,参数管理系统10可以使用主备验证策略,先将推荐的目标参数在备用节点上进行修改,在修改时,还可以使用白名单策略,同时参数管理系统10备用节点上应用的性能,当备份节点上的应用能够取得性能提升,再将目标参数配置到主节点。
上述S412至S414为本申请实施例的可选步骤,执行本申请实施例的方法也可以不执行上述步骤。例如,目标参数的置信度较高时,也可以直接配置目标参数至现网环境。
基于上述内容描述,本申请实施例提供了一种参数管理方法,该方法基于历史数据如历史交互记录或历史运行记录进行参数推荐,将在线验证次数降为0,每当有新的负载特征即“当前负载特征”输入,可以无需增量训练直接输出与当前负载特征对应的目标参数。其中,优先搜索历史交互记录,以获取与当前负载特征对应的目标参数,可以极大地缩减寻优时间,解决在线优化的时间约束问题,满足在线环境的即时性优化需求。并且该方法能够分析业务特点,在推荐应用的目标参数的同时,也推荐适合当前业务特点的目标硬件规格。该方法还支持对目标参数进行安全性检查,结合灰度验证与模拟性能评估等方法,判断修改参数能否获得预期效果,对符合要求的参数采取逐步更换的方法修改上线,对于不满足要求的参数进行拦截并重新推荐,如此保障了参数上线的安全性。
图4所示实施例的关键在于性能模拟器,下面对训练性能模拟器的过程进行详细说明。
为了减少对用户运行时环境的交互改动,同时不占用用户的资源进行搜索优化,本申请实施例可以采取离线构造性能模拟器的方式进行搜索优化。同时考虑到不断变化的业务负载,参数管理系统10可以将业务相关的负载特征刻画为环境变量,与经过筛选的参数结合一起进行数据采样构造性能模拟器,使用该方法得到的性能模拟器能够对客户端多变的使用场景做出准确的反馈,同时也避免了为每个客户端场景都训练一个模型的额外开销,实现了能够面向动态环境的在线优化方法。
参见图7所示的动态负载模拟的示意图,用户可以选择开放可配置的参数,包括客户 端配置(client configuration)参数(记作configclient)和服务端配置(server configuration)参数(记作configserver)中的一种或多种。同时,用户选择将负载特征刻画为环境变量(记作envsclient)。参数管理系统10可以将客户端配置参数(即客户端参数)和服务端配置参数(即服务端参数)组合,使用皮尔森相关系数等特征筛选方法对参数进行敏感度分析,筛选出前n个关键参数,具体如下所示:
其中,configimp表征关键参数,Listconfig表征按照敏感度排序的参数清单。
训练性能模拟器的目的是为了能够对各种输入做出准确的反馈,如图8所示,传统的蒙特卡洛采样方法可以导致在参数空间内呈现样本点聚集的特点,不利于性能模拟器的训练,因此可以选用拉丁超立方采样(Latin hypercube sampling,LHS)在样本空间均匀采样。
为了更加高效地采样数据样本,采集到的数据样本更接近于用户在线运行的真实负载,可以根据历史用户的环境变量与参数的数据分布,加权调整LHS采样的分层窗口大小,如图9中右图所示。同时为了性能模拟器对整体参数空间的正确反馈,参数管理系统10可以采用混合拉丁超立方采样mixLHS进行采样。具体地,参见图9,设置数据总量为D,其中,D/N可以采用均匀采样,另外D(N-1)/N采用加权非均匀采样。参数管理系统10可以将二者组成训练数据集合。使用该训练数据集合训练得到的性能模拟器,能够对常见的客户端样本分布做出更准确的反应。
其中,参数管理系统10使用mixLHS方法对关键参数进行采样获得数据样本X,如下所示:
X=mixLHS(boundsclient env,boundsimp config)   (2)
boundsclient env表征环境变量的界限(取值范围),boundsimp config表征关键参数的界限。
参数管理系统10可以在离线环境对数据样本X执行验证,获得数据样本的真实性能,作为数据样本X对应的真实反馈Y,然后可以将X,Y组合构成训练模拟器的训练数据集合,用于训练与所述当前硬件规格对应的所述性能模拟器。具体如下所示:
为了使得本申请的方案更加易于理解,下面以云上分布式消息队列(DMS)应用参数优化为例进行介绍。
如图10所示,DMS集群分为客户端(包括生产者与消费者),服务端多节点(broker)部署。待优化的参数分别为客户端参数batch_size,linger_time,partitions等与服务端参数num.network.threads,num.io.threads,num.replica.fetchers等。客户端环境变量:与用户使用场景相关的业务设置,用于刻画业务场景的工作负载。
在进行参数优化时,可以进行如下步骤:
1、服务初始化:从数据存储中心统计现网常见硬件规格,在离线环境下采样数据,然后基于采样的数据构造训练数据集合,基于该训练数据集合训练出对应规格的性能模拟器。
2、业务性能监控:业务监控装置200将性能监控代理分别部署在客户端和服务端,以监控业务场景的性能指标(吞吐量,时延,CPU占用率,磁盘IO,网络带宽)。业务监控 装置200将业务场景信息反馈给参数优化装置100,反馈的业务场景信息包括但不限于客户端环境变量,参数,硬件规格。参数优化装置100读取到客户端环境变量输入到对应规格的性能模拟器,获得预测性能。业务监控装置200可以根据真实性能和预测性能判断当前场景是否需要进行优化。
3、参数优化:当经过性能模拟器评估之后判断需要进行优化,则触发参数优化装置100进行参数优化。其中,参数优化装置100可以优先使用客户端环境变量与服务端硬件规格为条件到数据存储模块104进行索引,寻找是否存在历史同规格优化参数记录可以复用,若存在则直接反馈搜索到的第一目标参数到参数安全检查装置300。若不存在则寻找对应规格的性能模拟器106,若不存在对应规格则记录下来在离线环境补充训练,同时使用高斯拟合算法预测出一组第一目标参数。若存在对应规格的性能模拟器则使用贝叶斯优化搜索出第一目标参数,然后传递给参数安全检查装置300。同时参数优化装置100能够根据历史数据如历史运行数据,使用高斯拟合算法预测出适合当前环境变量的目标硬件规格,然后基于目标硬件规格搜索出第二目标参数,然后将新的目标硬件规格与第二目标参数推荐给用户。
4、参数安全检查:目标参数被上线之前需要先经过性能评估模型的验证,然后再经过白名单参数安全范围的检验,此处白名单参数范围可由用户手动配置。当经过以上验证之后,目标参数将首先配置在分布式消息队列集群的单节点中,此时观察该节点性能是否达到预期效果且运行正常,若正常则逐步替换其他节点参数,若出现节点宕机或未达到预期效果则回滚当前节点参数,同时反馈给参数优化装置100重新计算目标参数。
本申请实施例的参数管理方法普遍适用于云上应用软件,也适用于数据库,中间件,大数据计算引擎的参数优化,具体优化过程如下所示:
1、服务初始化。
2、业务性能监控。其中,应用形态可以为集群部署应用,也可以为单节点部署应用,应用侧部署客户端代理。客户端代理不局限于具体安装包,若参数管理系统10本身具备采集所需数据以及修改参数的API接口,可将其看作逻辑上客户端代理。
3、参数优化。参数管理系统10根据历史数据推理所使用的AI模型,包括但不限于回归模型(如高斯拟合,随机森林等)。参数管理系统10使用的采样方法也不限于期望改进(expected improvement,EI)采集函数、上置信边界(upper confidence bound,UCB)采集函数或LHS等。
4、参数安全检查。
其中,服务初始化、参数安全检查的具体实现可以参见上文相关内容描述,在此不再赘述。
该方法引入业务监控装置200,可以替代掉传统的人工以及任务触发方式,通过主动监控应用业务的各项性能指标,利用AI算法自动判断参数优化的时机主动触发参数优化服务,不需要人工干预或任务触发。该方法还引入参数优化装置100,通过参数调优器,能够进一步压缩在线优化算法所需要的在线交互验证成本,做到0增量交互的在线参数调优,并且能够分析用户的业务特点,在推荐应用最优参数的同时也推荐适合当前业务特点的最 优规格。此外,该方法引入参数安全检查模块,对优化服务给出参数进行安全性检查,结合灰度验证与模拟性能评估等方法,判断修改参数能否获得预期效果,对符合要求的参数采取逐步更换的方法修改上线,对于不满足要求的参数进行拦截并反馈给优化服务重新推荐,避免优化参数在现网环境出现非预期效果对业务造成影响。
基于本申请实施例提供的参数管理方法,本申请实施例还提供了一种如前述的参数管理系统10。下面对参数管理系统10的结构进行详细介绍。
参见图11所示的参数管理系统10的结构示意图,该参数管理系统10包括:
通信模块101,用于获取应用在现网环境的当前负载特征;
参数调优模块102,用于根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数,所述历史数据包括历史交互记录或历史运行记录,所述历史交互记录包括第一历史负载特征以及根据第一历史负载特征推荐的参数,所述历史运行记录包括第二历史负载特征和历史运行参数;
推荐模块103,用于向用户推荐所述目标参数。
其中,通信模块101、推荐模块103可以是图1所示的参数优化装置100中的模块,也可以是其他装置中的模块。例如,推荐模块103也可以是图1所示的参数安全检查装置300中的模块。
上述装置、模块的划分方式仅为本申请实施例提供的一种可能的实现方式。在其他可能的实现方式中,可以根据需要对参数管理系统10采用不同的划分方式,本申请实施例对此不做限制。例如,参数管理系统10也可以不包括业务监控装置200和参数安全检查装置300,业务监控装置200、参数安全检查装置300的功能可以由参数优化装置100实现。
在本实施例中,通信模块101、参数调优模块102、推荐模块103可以通过硬件模块实现或通过软件模块实现。其中,通信模块101可以通过收发器或者收发器上的软件实现。参数调优模块102和推荐模块103可以通过计算设备或者计算设备上的计算引擎实现。下面,以参数调优模块102为例进行说明。
当通过软件实现时,参数调优模块102可以是运行在计算设备或计算设备集群上的应用程序或者应用程序模块,如计算引擎等。
当通过硬件实现时,参数调优模块102中可以包括至少一个计算设备,如服务器等。或者,参数调优模块102也可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。
在一些可能的实现方式中,所述目标参数包括在当前硬件规格下的第一目标参数;
所述通信模块101还用于:
获取所述应用在现网环境的当前硬件规格;
所述参数调优模块102具体用于:
根据所述应用在现网环境的当前负载特征和当前硬件规格搜索所述历史交互记录,获得所述第一目标参数;或者,
根据所述历史运行记录,通过机器学习算法推理与所述当前负载特征和所述当前硬件规格对应的所述第一目标参数。
其中,参数管理系统10还可以包括数据存储模块104,数据存储模块104用于存储历史数据,例如是存储历史交互记录或历史运行记录,相应地,参数调优模块102可以根据应用在现网环境的当前负载特征和当前硬件规格搜索数据存储模块104存储的历史交互记录,从而获得第一目标参数。参数调优模块102也可以从数据存储模块104获取历史运行记录,通过机器学习算法推理与当前负载特征和当前硬件规格对应的所述第一目标参数。
上述数据存储模块104可以通过软件或硬件实现。当数据存储模块104通过软件实现时,数据存储模块104可以包括存储引擎。当数据存储模块104通过硬件实现时,数据存储模块104可以包括至少一个具有数据存储能力的存储设备。
在一些可能的实现方式中,所述参数调优模块102具体用于:
确定与所述当前硬件规格对应的性能模拟器106,所述性能模拟器106通过所述历史运行记录训练得到;
通过所述性能模拟器106驱动所述机器学习算法,以推理与所述当前负载特征和所述当前硬件规格对应的所述第一目标参数。
在一些可能的实现方式中,所述系统10还包括:
训练模块108,用于使用混合拉丁超立方采样mixLHS对所述历史运行记录中与所述当前硬件规格匹配的子数据集进行采样,获得数据样本;在离线环境对数据样本进行验证,获得所述数据样本的真实性能;根据所述数据样本和所述真实性能,训练与所述当前硬件规格对应的所述性能模拟器。
与参数调优模块102类似,训练模块108可以通过硬件模块实现或通过软件模块实现。
当通过软件实现时,训练模块108可以是运行在计算设备或计算设备集群上的应用程序或者应用程序模块,如计算引擎等。
当通过硬件实现时,训练模块108中可以包括至少一个计算设备,如服务器等。或者,训练模块108也可以是利用专用集成电路ASIC实现、或可编程逻辑器件PLD实现的设备等。其中,上述PLD可以是复杂程序逻辑器件CPLD、现场可编程门阵列FPGA、通用阵列逻辑GAL或其任意组合实现。
进一步地,训练模块108也可以是其他装置中的模块。例如,训练模块108也可以是单独的训练装置中的模块。
在一些可能的实现方式中,所述目标参数包括在目标硬件规格下的第二目标参数;
所述参数调优模块102具体用于:
根据所述历史运行记录,通过机器学习算法推理所述目标硬件规格以及与所述当前负载特征和所述目标硬件规格对应的所述第二目标参数;
所述推荐模块103还用于:
向所述用户推荐所述目标硬件规格。
在一些可能的实现方式中,所述参数调优模块102具体用于:
根据所述历史运行记录,通过机器学习算法推理获得与所述当前负载特征对应的目标硬件规格;
根据所述当前负载特征、所述目标硬件规格以及所述历史数据,确定所述第二目标参数。
在一些可能的实现方式中,所述系统10还包括:
监控模块202,用于监控所述应用在现网环境的真实性能;
所述参数调优模块102,具体用于当所述真实性能满足触发条件,执行所述根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数的步骤。
其中,监控模块202可以是图1所示的业务监控装置200中的模块。进一步地,业务监控装置200中还可以包括通信模块201。通信模块201用于真实性能满足触发条件时,指示参数调优模块102执行根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数的步骤。
进一步地,通信模块201还用于获取应用在现网环境的当前负载特征,并输入业务监控装置200中的性能评估模型。性能评估模型可以调用性能模拟器106进行性能评估,获得预测性能。当预测性能和真实性能的差值大于预设值,通信模块201可以指示参数调优模块102进行参数调优。
在一些可能的实现方式中,所述系统还10包括:
验证模块302,用于对所述目标参数进行验证;
配置模块304,用于当验证通过,将所述目标参数配置至所述现网环境。
其中,验证模块302、配置模块304可以是图1所示的参数安全检查装置300中的模块。验证模块302、配置模块304可以通过硬件模块实现,或者通过软件模块实现。
当通过软件实现时,验证模块302、配置模块304可以是运行在计算设备或计算设备集群上的应用程序或者应用程序模块,如计算引擎等。
当通过硬件实现时,验证模块302、配置模块304可以包括至少一个计算设备,如服务器等。或者,验证模块302、配置模块304也可以是利用专用集成电路ASIC实现、或可编程逻辑器件PLD实现的设备等。其中,上述PLD可以是复杂程序逻辑器件CPLD、现场可编程门阵列FPGA、通用阵列逻辑GAL或其任意组合实现。
在一些可能的实现方式中,所述验证模块302具体用于:
确定所述目标参数对应的安全范围约束;
当所述目标参数满足所述安全范围约束,且离线验证记录或历史交互记录中的参数与所述目标参数的接近程度大于预设值,确定所述目标参数验证通过。
在一些可能的实现方式中,所述应用部署在集群中的多个节点,所述配置模块304还用于:
将所述目标参数配置至所述多个节点中的至少一个节点;
所述系统10还包括:
监控模块202,用于监控所述应用在所述至少一个节点的真实性能;
所述验证模块302具体用于:
当所述应用在所述至少一个节点的真实性能提升,则确定所述目标参数验证通过。
在一些可能的实现方式中,所述应用部署在主节点和备用节点,所述验证模块302具 体用于:
将所述目标参数配置至所述备用节点;
监控所述应用在所述备用节点的真实性能;
当所述应用在所述备用节点的真实性能提升,则确定所述目标参数验证通过。
本申请还提供一种计算设备1200。如图12所示,计算设备1200包括:总线1202、处理器1204、存储器1206和通信接口1208。处理器1204、存储器1206和通信接口1208之间通过总线1202通信。计算设备1200可以是服务器或终端设备。应理解,本申请不限定计算设备1200中的处理器、存储器的个数。
总线1202可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图12中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线1202可包括在计算设备1200各个部件(例如,存储器1206、处理器1204、通信接口1208)之间传送信息的通路。
处理器1204可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。
存储器1206可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。处理器1204还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。存储器1206中存储有可执行的程序代码,处理器1204执行该可执行的程序代码以实现前述参数管理方法。具体的,存储器1206上存有参数管理系统10用于执行参数管理方法的指令。
通信接口1208使用例如但不限于网络接口卡、收发器一类的收发模块,来实现计算设备1200与其他设备或通信网络之间的通信。
本申请实施例还提供了一种计算设备集群。该计算设备集群包括至少一台计算设备1200。该计算设备1200可以是服务器,例如是中心服务器、边缘服务器,或者是本地数据中心中的本地服务器。在一些实施例中,计算设备1200也可以是台式机、笔记本电脑或者智能手机等终端设备。
如图13所示,所述计算设备集群包括至少一个计算设备1200。计算设备集群中的一个或多个计算设备1200中的存储器1206可以存有相同的参数管理系统10用于执行参数管理方法的指令。
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备1200也可以用于执行参数管理系统10用于执行参数管理方法的部分指令。换言之,一个或多个计算设备1200的组合可以共同执行参数管理系统10用于执行参数管理方法的指令。
需要说明的是,计算设备集群中的不同的计算设备1200中的存储器1206可以存储不同的指令,用于执行参数管理系统10的部分功能。
图14示出了一种可能的实现方式。如图14所示,两个计算设备1200A和1200B通过通信接口1208实现连接。
计算设备1200A中的存储器上存有用于执行参数优化装置100的功能的指令,例如,计算设备1200A中的存储器上存有执行通信模块101、参数调优模块102、推荐模块103的功能的指令,进一步地,计算设备1200A中的存储器上还存有执行数据存储模块104、性能模拟器106、训练模块108的功能的指令。计算设备1200A中的存储器还存有用于执行业务监控装置200的功能的指令,例如,计算设备1200A中的存储器上存有执行通信模块201、监控模块202的功能的指令。
计算设备1200B中的存储器上存有用于执行参数安全检查装置300的功能的指令。例如,计算设备1200B中的存储器上存有执行验证模块302、配置模块304的功能的指令。
换言之,计算设备1200A和1200B的存储器1206共同存储了参数管理系统10用于执行参数管理方法的指令。
图14所示的计算设备集群之间的连接方式可以是考虑到本申请提供的参数管理方法需要业务监控装置200监控真实性能,以触发参数优化装置100进行参数调优。因此考虑将参数优化装置100、业务监控装置200实现的功能交由计算设备1200A执行,参数安全检查装置300实现的功能由计算设备1200B执行。
应理解,图14中示出的计算设备1200A的功能也可以由多个计算设备1200完成。同样,计算设备1200B的功能也可以由多个计算设备1200完成。
在一些可能的实现方式中,计算设备集群中的一个或多个计算设备可以通过网络连接。其中,所述网络可以是广域网或局域网等等。图15示出了一种可能的实现方式。如图15所示,两个计算设备1200C和1200D之间通过网络进行连接。具体地,通过各个计算设备中的通信接口与所述网络进行连接。在这一类可能的实现方式中,计算设备1200C中的存储器1206中存有执行参数优化装置100的功能的指令。进一步地,计算设备1200C中的存储器1206中还存有执行业务监控装置200的功能的指令。同时,计算设备1200D中的存储器1206中存有执行参数安全检查装置300的功能的指令。
图15所示的计算设备集群之间的连接方式可以是考虑到本申请提供的缓存管理方法需要业务监控装置200监控真实性能,以触发参数优化装置100进行参数调优。因此考虑将参数优化装置100、业务监控装置200实现的功能交由计算设备1200C执行,参数安全检查装置300实现的功能由计算设备1200D执行。
应理解,图15中示出的计算设备1200C的功能也可以由多个计算设备1200完成。同样,计算设备1200D的功能也可以由多个计算设备1200完成。
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行上述应用于参数管理系统10用于执行参数管理方法。
本申请实施例还提供了一种包含指令的计算机程序产品。所述计算机程序产品可以是包含指令的,能够运行在计算设备1200上或被储存在任何可用介质中的软件或程序产品。 当所述计算机程序产品在至少一个计算设备1200上运行时,使得至少一个计算设备1200执行上述参数管理方法。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的保护范围。

Claims (25)

  1. 一种参数管理方法,其特征在于,所述方法包括:
    获取应用在现网环境的当前负载特征;
    根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数,所述历史数据包括历史交互记录或历史运行记录,所述历史交互记录包括第一历史负载特征以及根据第一历史负载特征推荐的参数,所述历史运行记录包括第二历史负载特征和历史运行参数;
    向用户推荐所述目标参数。
  2. 根据权利要求1所述的方法,其特征在于,所述目标参数包括在当前硬件规格下的第一目标参数;
    所述方法还包括:
    获取所述应用在现网环境的当前硬件规格;
    所述根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数,包括:
    根据所述应用在现网环境的当前负载特征和当前硬件规格搜索所述历史交互记录,获得所述第一目标参数;或者,
    根据所述历史运行记录,通过机器学习算法推理与所述当前负载特征和所述当前硬件规格对应的所述第一目标参数。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述历史运行记录,通过机器学习算法推理与所述当前负载特征和所述当前硬件规格对应的所述第一目标参数,包括:
    确定与所述当前硬件规格对应的性能模拟器,所述性能模拟器通过所述历史运行记录训练得到;
    通过所述性能模拟器驱动所述机器学习算法,以推理与所述当前负载特征和所述当前硬件规格对应的所述第一目标参数。
  4. 根据权利要求3所述的方法,其特征在于,所述性能模拟器通过如下方式训练得到:
    使用混合拉丁超立方采样mixLHS对所述历史运行记录中与所述当前硬件规格匹配的子数据集进行采样,获得数据样本;
    在离线环境对数据样本进行验证,获得所述数据样本的真实性能;
    根据所述数据样本和所述真实性能,训练与所述当前硬件规格对应的所述性能模拟器。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述目标参数包括在目标硬件规格下的第二目标参数;
    所述根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数,包括:
    根据所述历史运行记录,通过机器学习算法推理所述目标硬件规格以及与所述当前负载特征和所述目标硬件规格对应的所述第二目标参数;
    所述方法还包括:
    向所述用户推荐所述目标硬件规格。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述历史运行记录,通过机器 学习算法推理所述目标硬件规格以及与所述当前负载特征和所述目标硬件规格对应的所述第二目标参数,包括:
    根据所述历史运行记录,通过机器学习算法推理获得与所述当前负载特征对应的目标硬件规格;
    根据所述当前负载特征、所述目标硬件规格以及所述历史数据,确定所述第二目标参数。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述方法还包括:
    监控所述应用在现网环境的真实性能;
    当所述真实性能满足触发条件,执行所述根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数的步骤。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述方法还包括:
    对所述目标参数进行验证;
    当验证通过,将所述目标参数配置至所述现网环境。
  9. 根据权利要求8所述的方法,其特征在于,所述对所述目标参数进行验证,包括:
    确定所述目标参数对应的安全范围约束;
    当所述目标参数满足所述安全范围约束,且离线验证记录或历史交互记录中的参数与所述目标参数的接近程度大于预设值,确定所述目标参数验证通过。
  10. 根据权利要求8所述的方法,其特征在于,所述应用部署在集群中的多个节点,所述对所述目标参数进行验证,包括:
    将所述目标参数配置至所述多个节点中的至少一个节点;
    监控所述应用在所述至少一个节点的真实性能;
    当所述应用在所述至少一个节点的真实性能提升,则确定所述目标参数验证通过。
  11. 根据权利要求8所述的方法,其特征在于,所述应用部署在主节点和备用节点,所述对所述目标参数进行验证,包括:
    将所述目标参数配置至所述备用节点;
    监控所述应用在所述备用节点的真实性能;
    当所述应用在所述备用节点的真实性能提升,则确定所述目标参数验证通过。
  12. 一种参数管理系统,其特征在于,所述系统包括:
    通信模块,用于获取应用在现网环境的当前负载特征;
    参数调优模块,用于根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数,所述历史数据包括历史交互记录或历史运行记录,所述历史交互记录包括第一历史负载特征以及根据第一历史负载特征推荐的参数,所述历史运行记录包括第二历史负载特征和历史运行参数;
    推荐模块,用于向用户推荐所述目标参数。
  13. 根据权利要求12所述的系统,其特征在于,所述目标参数包括在当前硬件规格下的第一目标参数;
    所述通信模块还用于:
    获取所述应用在现网环境的当前硬件规格;
    所述参数调优模块具体用于:
    根据所述应用在现网环境的当前负载特征和当前硬件规格搜索所述历史交互记录,获得所述第一目标参数;或者,
    根据所述历史运行记录,通过机器学习算法推理与所述当前负载特征和所述当前硬件规格对应的所述第一目标参数。
  14. 根据权利要求13所述的系统,其特征在于,所述参数调优模块具体用于:
    确定与所述当前硬件规格对应的性能模拟器,所述性能模拟器通过所述历史运行记录训练得到;
    通过所述性能模拟器驱动所述机器学习算法,以推理与所述当前负载特征和所述当前硬件规格对应的所述第一目标参数。
  15. 根据权利要求14所述的系统,其特征在于,所述系统还包括:
    训练模块,用于使用混合拉丁超立方采样mixLHS对所述历史运行记录中与所述当前硬件规格匹配的子数据集进行采样,获得数据样本;在离线环境对数据样本进行验证,获得所述数据样本的真实性能;根据所述数据样本和所述真实性能,训练与所述当前硬件规格对应的所述性能模拟器。
  16. 根据权利要求12至15任一项所述的系统,其特征在于,所述目标参数包括在目标硬件规格下的第二目标参数;
    所述参数调优模块具体用于:
    根据所述历史运行记录,通过机器学习算法推理所述目标硬件规格以及与所述当前负载特征和所述目标硬件规格对应的所述第二目标参数;
    所述推荐模块还用于:
    向所述用户推荐所述目标硬件规格。
  17. 根据权利要求16所述的系统,其特征在于,所述参数调优模块具体用于:
    根据所述历史运行记录,通过机器学习算法推理获得与所述当前负载特征对应的目标硬件规格;
    根据所述当前负载特征、所述目标硬件规格以及所述历史数据,确定所述第二目标参数。
  18. 根据权利要求12至17任一项所述的系统,其特征在于,所述系统还包括:
    监控模块,用于监控所述应用在现网环境的真实性能;
    所述参数调优模块,具体用于当所述真实性能满足触发条件,执行所述根据所述应用在现网环境的当前负载特征以及所述应用在现网环境的历史数据,确定与所述当前负载特征对应的目标参数的步骤。
  19. 根据权利要求12至18任一项所述的系统,其特征在于,所述系统还包括:
    验证模块,用于对所述目标参数进行验证;
    配置模块,用于当验证通过,将所述目标参数配置至所述现网环境。
  20. 根据权利要求19所述的系统,其特征在于,所述验证模块具体用于:
    确定所述目标参数对应的安全范围约束;
    当所述目标参数满足所述安全范围约束,且离线验证记录或历史交互记录中的参数与 所述目标参数的接近程度大于预设值,确定所述目标参数验证通过。
  21. 根据权利要求19所述的系统,其特征在于,所述应用部署在集群中的多个节点,所述配置模块还用于:
    将所述目标参数配置至所述多个节点中的至少一个节点;
    所述系统还包括:
    监控模块,用于监控所述应用在所述至少一个节点的真实性能;
    所述验证模块具体用于:
    当所述应用在所述至少一个节点的真实性能提升,则确定所述目标参数验证通过。
  22. 根据权利要求19所述的系统,其特征在于,所述应用部署在主节点和备用节点,所述验证模块具体用于:
    将所述目标参数配置至所述备用节点;
    监控所述应用在所述备用节点的真实性能;
    当所述应用在所述备用节点的真实性能提升,则确定所述目标参数验证通过。
  23. 一种计算设备集群,其特征在于,所述计算设备集群包括至少一台计算设备,所述至少一台计算设备包括至少一个处理器和至少一个存储器,所述至少一个存储器中存储有计算机可读指令;所述至少一个处理器执行所述计算机可读指令,以使得所述计算设备集群执行如权利要求1至11任一项所述的方法。
  24. 一种包含指令的计算机程序产品,其特征在于,当所述指令被计算设备集群运行时,使得所述计算设备集群执行如权利要求1至11任一项所述的方法。
  25. 一种计算机可读存储介质,其特征在于,包括计算机程序指令,当所述计算机程序指令由计算设备集群执行时,所述计算设备集群执行如权利要求1至11任一项所述的方法。
PCT/CN2023/081469 2022-08-17 2023-03-14 一种参数管理系统以及相关方法 WO2024036941A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210987574 2022-08-17
CN202210987574.7 2022-08-17
CN202211288871.9A CN117632673A (zh) 2022-08-17 2022-10-20 一种参数管理系统以及相关方法
CN202211288871.9 2022-10-20

Publications (1)

Publication Number Publication Date
WO2024036941A1 true WO2024036941A1 (zh) 2024-02-22

Family

ID=89940516

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/081469 WO2024036941A1 (zh) 2022-08-17 2023-03-14 一种参数管理系统以及相关方法

Country Status (1)

Country Link
WO (1) WO2024036941A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150288573A1 (en) * 2014-04-08 2015-10-08 International Business Machines Corporation Hyperparameter and network topology selection in network demand forecasting
CN111176758A (zh) * 2019-12-31 2020-05-19 腾讯科技(深圳)有限公司 配置参数的推荐方法、装置、终端及存储介质
CN111381507A (zh) * 2018-12-29 2020-07-07 珠海格力电器股份有限公司 电器操作参数的推荐方法、介质、服务器及智能电器管理系统
CN112448855A (zh) * 2021-01-28 2021-03-05 支付宝(杭州)信息技术有限公司 区块链系统参数更新方法和系统
CN113343577A (zh) * 2021-06-23 2021-09-03 平安国际融资租赁有限公司 一种参数优化方法、装置、计算机设备及可读存储介质
CN113438663A (zh) * 2020-03-04 2021-09-24 诺基亚通信公司 基于机器学习的切换参数优化

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150288573A1 (en) * 2014-04-08 2015-10-08 International Business Machines Corporation Hyperparameter and network topology selection in network demand forecasting
CN111381507A (zh) * 2018-12-29 2020-07-07 珠海格力电器股份有限公司 电器操作参数的推荐方法、介质、服务器及智能电器管理系统
CN111176758A (zh) * 2019-12-31 2020-05-19 腾讯科技(深圳)有限公司 配置参数的推荐方法、装置、终端及存储介质
CN113438663A (zh) * 2020-03-04 2021-09-24 诺基亚通信公司 基于机器学习的切换参数优化
CN112448855A (zh) * 2021-01-28 2021-03-05 支付宝(杭州)信息技术有限公司 区块链系统参数更新方法和系统
CN113343577A (zh) * 2021-06-23 2021-09-03 平安国际融资租赁有限公司 一种参数优化方法、装置、计算机设备及可读存储介质

Similar Documents

Publication Publication Date Title
US10521324B2 (en) Programmatically classifying alarms from distributed applications
US10637737B2 (en) Managing alarms from distributed applications
EP3570494B1 (en) A framework for intelligent automated operations for network, service & customer experience management
US10614375B2 (en) Machine for development and deployment of analytical models
US9367803B2 (en) Predictive analytics for information technology systems
EP3182280B1 (en) Machine for development of analytical models
US10055275B2 (en) Apparatus and method of leveraging semi-supervised machine learning principals to perform root cause analysis and derivation for remediation of issues in a computer environment
EP3887958B1 (en) Predictive system remediation
US9262231B2 (en) System and method for modifying a hardware configuration of a cloud computing system
EP3447660A1 (en) Machine learning based database system management
US20140047079A1 (en) System and method for emulating a desired network configuration in a cloud computing system
US11531581B2 (en) Event root cause identification for computing environments
US20230146912A1 (en) Method, Apparatus, and Computing Device for Constructing Prediction Model, and Storage Medium
CN107544832A (zh) 一种虚拟机进程的监控方法、装置和系统
CN117581239A (zh) 用于人工智能定义网络的系统和方法
Xu et al. Lightweight and adaptive service api performance monitoring in highly dynamic cloud environment
US20210014135A1 (en) Wan tunnel kpi/sla prediction and schedule recommender
CN108390771A (zh) 一种网络拓扑重建方法和装置
KR20200126766A (ko) Ict 인프라의 운용 관리 장치 및 방법
WO2024036941A1 (zh) 一种参数管理系统以及相关方法
WO2022133690A1 (en) Efficient resource allocation for service level compliance
US11397613B2 (en) Process prioritization for information handling systems
Taherizadeh et al. Incremental learning from multi-level monitoring data and its application to component based software engineering
CN117632673A (zh) 一种参数管理系统以及相关方法
Stanford Geo-distributed stream processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23853858

Country of ref document: EP

Kind code of ref document: A1