CN116450280A - Container mixed expansion and contraction method and system based on machine learning - Google Patents

Container mixed expansion and contraction method and system based on machine learning Download PDF

Info

Publication number
CN116450280A
CN116450280A CN202211554550.9A CN202211554550A CN116450280A CN 116450280 A CN116450280 A CN 116450280A CN 202211554550 A CN202211554550 A CN 202211554550A CN 116450280 A CN116450280 A CN 116450280A
Authority
CN
China
Prior art keywords
pod
data
quota
controller
usage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211554550.9A
Other languages
Chinese (zh)
Inventor
景宇
闫海娜
刘磊
杨帆
甄富
鞠娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Cloud Technology Co Ltd
Original Assignee
Tianyi Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Cloud Technology Co Ltd filed Critical Tianyi Cloud Technology Co Ltd
Priority to CN202211554550.9A priority Critical patent/CN116450280A/en
Publication of CN116450280A publication Critical patent/CN116450280A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/301Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is a virtual computing platform, e.g. logically partitioned systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a container mixed expansion and contraction method and system based on machine learning, wherein the method comprises the following steps: the metrics server collects system resource use state data through kubelet, wherein the system resource use state data comprises CPU, memory, GPU and disk IO use conditions of pod; prometheuseadaptar collects custom collected data types by Prometheus, including: request number per second, concurrency number, PPS, delay, interface response success rate; each pod controller collects real-time data to form an n-dimensional array, wherein the data in the n-dimensional array is n factors influencing the resource usage, and the content of the n-dimensional array is as follows: the method comprises the steps of sorting data according to time, namely [ pod name, timestamp, CPU usage of pod, memory usage of pod, GPU usage of pod, disk IO usage of pod, request per second of pod, concurrency of pod, PPS of pod, delay of pod, interface response success rate of pod ], and storing the data into a database; and recording quota setting under the current scale for the CPU quota set in the pod controller and the memory quota set in the pod controller respectively.

Description

Container mixed expansion and contraction method and system based on machine learning
Technical Field
The application relates to the field of cloud computing, in particular to a container mixed expansion and contraction method and system based on machine learning.
Background
In the current cloud computing products, elastic expansion is generally to automatically adjust elastic computing resources after triggering a threshold according to application requirements and strategies, so that the gap between actual requirements and estimated resources is reduced, the essence of the elastic expansion is to solve the problem of supply-demand balance between resources and service loads, and the elastic expansion plays an important role in the functions and performances of the products.
Kubernetes mainly includes two strategies in terms of container expansion and contraction: one is the container horizontal expansion and contraction capacity, and the example copy number of the pod controller is automatically adjusted according to the real-time load of the container. The other is the vertical expansion and contraction capacity of the container, and the quota of the CPU and the memory of the Pod template in the Pod controller is automatically calculated or adjusted depending on the service load index.
The horizontal expansion capacity is mainly to adjust the copy number of the pod controller, and the expansion capacity is adjusted after the triggering threshold value, so that hysteresis is inevitably generated; and the quota specification of each copy is the same under the same pod controller, and the same quota of each copy can cause inaccurate expansion and contraction capacity and waste of resources. The vertical scaling is mainly to adjust the resource quota of the pod, and the problem to be solved is that the pod needs to be restarted every time, which is unacceptable to the business party.
Disclosure of Invention
The embodiment of the application provides a container mixed expansion and contraction method and system based on machine learning so as to improve the technical problems.
A container mixed expansion and contraction method based on machine learning comprises the following steps:
the metrics server collects system resource use state data through kubelet, wherein the system resource use state data comprises CPU, memory, GPU and disk IO use conditions of pod; prometheuseadaptar collects custom collected data types by Prometheus, including: request number per second, concurrency number, PPS, delay, interface response success rate; each pod controller collects real-time data to form an n-dimensional array, wherein the data in the n-dimensional array is n factors influencing the resource usage, and the content of the n-dimensional array is as follows: the method comprises the steps of sorting data according to time, namely [ pod name, timestamp, CPU usage of pod, memory usage of pod, GPU usage of pod, disk IO usage of pod, request per second of pod, concurrency of pod, PPS of pod, delay of pod, interface response success rate of pod ], and storing the data into a database; based on historical data, two columns are added in a data table, which are respectively a CPU quota set in the pod controller and a memory quota set in the pod controller, and quota setting under the current scale is recorded.
In some embodiments, further comprising: adding a manual tag column in the data table, wherein 0 represents abnormality, 1 represents normal, 2 represents incapability of judging, and if wrong data exist, removing the data by marking, and removing abnormal values of the data.
In some embodiments, further comprising: loading data according to the historical data, drawing a chart, repairing distorted data, and improving data quality; and integrating the historical contemporaneous data and the current day adjacent data to repair.
In some embodiments, further comprising: the method comprises the steps of collecting the resource usage of the pod controller to be predicted, calculating expected quota of the pod controller in the future x hours by a machine learning prediction algorithm, limiting the minimum quota MinDesiredValue and the maximum quota MaxDesiredValue, maxDesiredValue of the expected quota according to the current cluster resource remaining amount change, and preventing the predicted data from being too small or too large to influence the operation of other components.
In some embodiments, further comprising: the distance between the n-dimensional array x of the predicted pod controller and the n-dimensional array y in the historical data is collected, and the distance between the n-dimensional arrays of the historical data is calculated according to the Euclidean distance calculation method:
where i=1 to n, xi is an element of array x, yi is an element of array y.
In some embodiments, further comprising:
defining m samples closest to the sample to be predicted as neighbors of the sample to be predicted; training the m value continuously, and repeating until the m value with the minimum error rate is obtained; determining the resource quota required by the sample to be classified according to the resource quota required by most samples in the neighbor; the desired quota DesiredBassicValue is output.
In some embodiments, further comprising: based on the basic expected quota DesiredDasiValue, adding a periodic variation factor coefficient s (t) influenced by time, a holiday and special event item factor coefficient h (t), and other influence factor coefficients epsilon (t), and calculating a real target quota; desiredValue = DesiredDasiValue x s (t) x h (t) x epsilon (t); wherein t is time.
In some embodiments, further comprising: and calculating the difference between the target index and the actual index, and determining a strategy by using the mixed expansion-contraction capacity controller.
In some embodiments, further comprising:
calculating the ratio of the current quota CurrentMetricValue to the actual target quota DesiredValue: ratio = currentmetric value/DesiredValue;
the ratio is kept unchanged when the ratio is approximately equal to 1, otherwise the ratio is expanded when the CurrentMetricValue is smaller than DesiredValue, and the ratio is contracted when the CurrentMetricValue is larger than DesiredValue;
in some implementations, further comprising, calculating a quota to be adjusted:
AdjustValue=DesiredValue-CurrentMetricValue;
when AdjustValue < MinDesiredValue, adjustvalue=mindesiredvalue;
when AdjustValue > maxdesiridval, adjustvalue=maxdesiridval;
when MinDesiredValue < Adjust Value < MaxDairedValue, if the remaining resources of the nodes in the cluster cannot schedule Adjust Value, attempting to divide the Adjust Value into x quota according to the remaining resources of the schedulable nodes of the cluster, wherein the total sum is Adjust Value;
when MinDesiredValue < Adjust Value < MaxDesredValue, if the cluster can successfully schedule quota Adjust Value, checking whether a copy set with the specification exists in the pod controller according to Adjust Value, if so, directly expanding the copy set horizontally, if not, creating copy set update with the corresponding specification in the pod controller or creating copy set of the corresponding controller, informing the pod controller of the time stamp of the event needing to be operated, wherein the event comprises: the time stamp is expected to be adjusted, the action is adjusted, and the amount of resources to be adjusted is targeted.
A machine learning based container hybrid expansion and contraction system comprising: an electronic device for performing the method as described above.
The invention has the advantages that:
the pod controller supports innovation of duplicate sets of multiple quotas.
2. The design of the mixed expansion-contraction controller can realize different expansion-contraction activities of different duplicate sets, ensure that partial pod is not rebuilt while vertical expansion-contraction is carried out, and reduce jitter and fluctuation.
3. Predicting the future resource demand quota, and reducing the hysteresis rate.
4. When the residual quota of the cluster schedulable node is insufficient, the expected quota is tried to be divided into x quota according to the residual resources of the cluster schedulable node, and the capacity expansion is completed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a hybrid expansion and contraction block diagram based on machine learning;
FIG. 2 is a pod controller block diagram 1;
FIG. 3 is a pod controller block diagram 2;
FIG. 4 is a pod controller block diagram 3;
FIG. 5 is a flow chart of a machine learning prediction algorithm;
fig. 6 is a flow chart of a hybrid expansion and contraction controller.
Detailed Description
In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, based on the embodiments herein, which are within the scope of the protection of the present application, will be within the skill of the art without undue effort.
In this application, the terms "mounted," "connected," "secured," and the like are to be construed broadly unless otherwise specifically indicated or defined. For example, the connection can be fixed connection, detachable connection or integral connection; can be mechanically or electrically connected; the connection may be direct, indirect, or internal, or may be surface contact only, or may be surface contact via an intermediate medium. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art as the case may be.
Furthermore, the terms "first," "second," and the like, are used merely for distinguishing between descriptions and not for understanding as a specific or particular structure. The description of the terms "some embodiments," "other embodiments," and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this application, the schematic representations of the above terms are not necessarily for the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described herein, as well as features of various embodiments or examples, may be combined and combined by those skilled in the art without conflict.
The embodiment of the application provides a container mixed expansion and contraction method and system based on machine learning so as to improve the technical problems.
A container mixed expansion and contraction method based on machine learning comprises the following steps:
the metrics server collects system resource use state data through kubelet, wherein the system resource use state data comprises CPU, memory, GPU and disk IO use conditions of pod; prometheuseadaptar collects custom collected data types by Prometheus, including: request number per second, concurrency number, PPS, delay, interface response success rate; each pod controller collects real-time data to form an n-dimensional array, wherein the data in the n-dimensional array is n factors influencing the resource usage, and the content of the n-dimensional array is as follows: the method comprises the steps of sorting data according to time, namely [ pod name, timestamp, CPU usage of pod, memory usage of pod, GPU usage of pod, disk IO usage of pod, request per second of pod, concurrency of pod, PPS of pod, delay of pod, interface response success rate of pod ], and storing the data into a database; based on historical data, two columns are added in a data table, which are respectively a CPU quota set in the pod controller and a memory quota set in the pod controller, and quota setting under the current scale is recorded.
In some embodiments, further comprising: adding a manual tag column in the data table, wherein 0 represents abnormality, 1 represents normal, 2 represents incapability of judging, and if wrong data exist, removing the data by marking, and removing abnormal values of the data.
In some embodiments, further comprising: loading data according to the historical data, drawing a chart, repairing distorted data, and improving data quality; and integrating the historical contemporaneous data and the current day adjacent data to repair.
In some embodiments, further comprising: the method comprises the steps of collecting the resource usage of the pod controller to be predicted, calculating expected quota of the pod controller in the future x hours by a machine learning prediction algorithm, limiting the minimum quota MinDesiredValue and the maximum quota MaxDesiredValue, maxDesiredValue of the expected quota according to the current cluster resource remaining amount change, and preventing the predicted data from being too small or too large to influence the operation of other components.
In some embodiments, further comprising: the distance between the n-dimensional array x of the predicted pod controller and the n-dimensional array y in the historical data is collected, the distance between the n-dimensional arrays of the historical data is calculated according to the Euclidean distance calculation method, and n represents the space dimension.
Where i=1 to n, xi is an element of array x, yi is an element of array y.
In some embodiments, further comprising:
defining m samples closest to the sample to be predicted as neighbors of the sample to be predicted; the m value needs to be trained continuously until the m value with the minimum error rate is taken. Determining the resource quota required by the sample to be classified according to the resource quota required by most samples in the neighbor; the desired quota DesiredBassicValue is output.
In some embodiments, further comprising: based on the basic expected quota DesiredDasiValue, adding a periodic variation factor coefficient s (t) influenced by time, a holiday and special event item factor coefficient h (t), and other influence factor coefficients epsilon (t), and calculating a real target quota; desiredValue = DesiredDasiValue x s (t) x h (t) x epsilon (t); wherein t is time.
In some embodiments, further comprising: and calculating the difference between the target index and the actual index, and determining a strategy by using the mixed expansion-contraction capacity controller.
In some embodiments, further comprising:
the ratio of the currentMetricValue to the true target quota DesiredValue is calculated, the ratio=currentMetricValue/DesiredValue remains unchanged when ratio is about 1, otherwise the ratio is expanded when currentMetricValue < DesiredValue, and the ratio is contracted when currentMetricValue > DesiredValue.
In some embodiments, further comprising calculating a quota to be adjusted Adjust value = DesiredValue-CurrentMetricValue
When AdjustValue < MinDesiredValue, adjustvalue=mindesiredvalue
When AdjustValue > maxdesiridevalue, adjustvalue=maxdesiridevalue.
In some embodiments, when MinDesiredValue < AdjustValue < maxdesiridvalue, if the remaining resources of the nodes in the cluster cannot schedule AdjustValue, an attempt is made to divide AdjustValue into x quotas according to the remaining resources of the schedulable nodes of the cluster, the total being AdjustValue.
In some embodiments, further including when MinDesiredValue < AdjustValue < maxdesired value, if the cluster can successfully schedule quota AdjustValue, checking whether there is a copy set with the specification in the pod controller according to AdjustValue, if there is, directly performing horizontal expansion and contraction on the copy set, if there is, creating copy set update with the corresponding specification in the pod controller or creating copy set of the corresponding controller, notifying the pod controller of a time stamp of an event that needs to be operated, where the event includes: the time stamp is expected to be adjusted, the action is adjusted, and the amount of resources to be adjusted is targeted.
The invention also discloses a container mixed expansion and contraction system based on machine learning, which comprises: an electronic device for performing the method as described above.
The design is based on kubernetes horizontal and vertical expansion and contraction strategies, provides a prediction algorithm based on machine learning, performs mixed expansion and contraction, aims at timely and accurately adjusting, and a pod controller supports attribute setting of multiple quotas, wherein each quota is controlled by one copy set, and the quotas of multiple pods controlled by one copy set are the same; different adjustment strategies are supported for different duplicate sets under one pod controller, the adjusted resource accurate matching requirement is achieved, the cluster resource utilization rate is improved, the expansion and contraction times are reduced, and the advantages and disadvantages of horizontal expansion and contraction are integrated. The main method comprises the following steps: predicting future resource quota of the pod controller through a machine learning algorithm; the mixed expansion and contraction controller decides a strategy for expansion and contraction according to future resource quota; a copy set with multiple quotas is allowed to exist in a pod controller, and accurate expansion and contraction are realized according to a policy.
The realization steps are as follows:
1. and (3) constructing a historical database, collecting and landing the characteristic data, and taking the characteristic data as an input item of the machine learning prediction algorithm in the step (2). Collecting system resource usage data through the step 1.1; data of the custom items are collected by step 1.2. Machine learning predictive algorithms require timing to acquire and store data, which is exposed through metrics api.
1.1metrics server collects system resource usage status data including CPU, memory, GPU, disk IO usage of pod through kubelet.
1.2 PrometheuseAdaptar collects custom collected data types by Prometheus, including: requests per second, concurrency, PPS, delay, interface response success rate.
1.3, each pod controller collects real-time data to form an n-dimensional array, wherein the data in the n-dimensional array is n factors influencing the resource usage, and the content of the n-dimensional array is as follows: the data are stored in a database according to time sequence.
1.5, based on historical data, adding two columns in a data table, namely a CPU quota set in the pod controller and a memory quota set in the pod controller, and recording quota setting under the current scale.
1.6, adding a manual label column in the data table, wherein 0 represents abnormality, 1 represents normal, 2 represents incapability of judging, and if wrong data exist, removing the data through marking, and removing abnormal values of the data.
1.7, loading data according to historical data, drawing a chart, repairing distorted data and improving data quality; and integrating the historical contemporaneous data and the current day adjacent data to repair.
2. The method comprises the steps of collecting the resource usage of the pod controller to be predicted, calculating expected quota of the pod controller in the future x hours by a machine learning prediction algorithm, limiting the minimum quota MinDesiredValue and the maximum quota MaxDesiredValue, maxDesiredValue of the expected quota according to the current cluster resource remaining amount change, and preventing the predicted data from being too small or too large to influence the operation of other components.
2.1, collecting the distance between the n-dimensional array x of the predicted pod controller and the n-dimensional array y in the historical data, and calculating the distance between the n-dimensional arrays of the historical data according to the Euclidean distance calculation method, wherein n represents the space dimension.
2.2, according to the calculated distance in 2.1, defining m samples with the nearest distance to the sample to be predicted as the neighbors of the sample to be predicted; the m value needs to be trained continuously until the m value with the minimum error rate is taken. Determining the resource quota required by the sample to be classified according to the resource quota required by most samples in the neighbor; outputting a desired quota DesiredBassicValue;
2.4 based on the basic expected quota DesiredDasiValue, adding a periodic variation factor coefficient s (t) influenced by time, a holiday and special event item factor coefficient h (t), and other influence factor coefficients epsilon (t), and calculating a real target quota;
DesiredValue=DesiredBasicValue*s(t)*h(t)*ε(t)。
s (t) is expressed as a periodical change factor influenced by time, the same pod controller has different request amounts for resources in different time periods, and the resource usage amount has obvious periodicity, so the adjustment can be carried out for month/week/day, the coefficient is adjusted to be higher at the crest of each day, and the coefficient is adjusted to be smaller at the trough of each day; if the resource requirement is higher in the whole month, the coefficient can be adjusted to be higher in the month unit, otherwise, the coefficient is adjusted to be smaller. The initialization in the calculation mode is
The larger the value here, the more pronounced the effect; the smaller this value, the less pronounced the effect of the season
h (t) is indicated as fluctuation caused by the influence of node days or special events on the resource forecast amount, di is indicated as a period of time before and after the ith node day, ki is indicated as the influence range of holidays, for example, business running on the pod controller is expected to have double eleven-like e-commerce activities and the like, and the resource usage amount is caused to fluctuate.
ε (t) is expressed as a noise term and the resource forecast is affected by other uncontrollable factors such as partial machine failures, policy variations, etc.;
3. and (3) calculating the difference between the target index and the actual index according to the target quota DesiredValue given in the step (2), and determining a strategy by using the mixed expansion and contraction controller.
3.1 calculating the ratio of the current quota to the target quota
ratio=CurrentMetricValue/DesiredValue
When ratio is about 1, the ratio is kept unchanged, otherwise, when the CurrentMetricValue < DesiredValue is expansion, and when the CurrentMetricValue > DesiredValue is contraction;
3.2 calculating quota to be adjusted
AdjustValue=DesiredValue-CurrentMetricValue;
3.2.1 when AdjustValue < MinDesiredValue, adjustvalue=mindesiredvalue;
3.2.2 when AdjustValue > maxdesiridedvalue, adjustvalue=maxdesiridedvalue;
3.2.3 when MinDesiredValue < Adjust value < MaxDairedValue, if the remaining resources of nodes in the cluster cannot schedule Adjust value, attempting to divide Adjust value into x quota according to the remaining resources of schedulable nodes of the cluster, and the total is Adjust value; into 3.2.5
3.2.4 when MinDesiredValue < Adjust value < MaxDesredValue, if the cluster can successfully schedule quota Adjust value, go to 3.2.5;
3.2.5 checking whether a copy set of the specification exists in the pod controller according to the Adjust value, if so, directly carrying out horizontal expansion and contraction on the copy set, if not, creating copy set update of the corresponding specification in the pod controller or creating copy set of the corresponding controller, informing the pod controller of the time stamp of the event needing to be operated, wherein the event comprises: the time stamp is expected to be adjusted, the action is adjusted, and the amount of resources to be adjusted is targeted.
And 4, after the pod controller receives the change event request of the hybrid expansion and contraction controller, carrying out hybrid expansion and contraction on the copy set under the pod controller according to the time stamp of the event, and improving the resource utilization rate through the strategy. The method is added as once as possible, so that waste is avoided, and too many copies are not added.
4.1, when a copy set of the specification of Adjust value exists in the pod controller, directly performing level adjustment; expansion or contraction as in fig. 2 may be performed.
FIG. 2pod controller block diagram-1
4.2 when there is no duplicate set of the specification of Adjust value in the pod controller, creating a duplicate set of the corresponding specification; as in fig. 3.
FIG. 3pod controller block diagram-2
4.3 when there is no copy set of the specification for AdjustValue in the pod controller and the remaining resources of a single node in the cluster cannot schedule AdjustValue, an attempt is made to partition AdjustValue into x quotas by the remaining resources of the schedulable nodes of the cluster, as shown in fig. 4.
FIG. 4pod controller block diagram-3
5. And (3) evaluating the result of the prediction model, and according to the comparison between the predicted expected resource desiredValue and the actual demand resource, starting from the step (2), adjusting the related parameters n, x, s (t), h (t), epsilon (t) and the like, so that the result is more accurate.
FIG. 5 machine learning prediction algorithm flow chart
The steps are as follows:
step1: and (3) constructing a historical database, supplementing data, screening, cleaning and mapping.
Step2: the resource usage of the pod controller to be predicted is collected.
Step3: according to the Euclidean distance calculating method, the distance between the to-be-predicted pod controller and the historical data n-dimensional array is calculated.
Step4: n samples closest to the sample to be predicted are defined, the n values are required to be trained continuously, and the resource quota required by the sample to be classified is determined according to the resource quota required by most samples in the neighbor; the desired quota DesiredBassicValue is output.
Step5: the factor coefficient s (t) of periodical change due to time influence, factor coefficient h (t) of holidays and special events and other factor coefficient epsilon (t) of influence are added, and the real target quota is calculated.
Step6: judging whether the target quota is reasonable or not, and if the target quota is unreasonable, adjusting the first-pass coefficient, and carrying out the steps again.
Mixed expansion and contraction controller flow
Fig. 6 is a flow chart of a hybrid expansion controller.
The steps are as follows:
step1: calculating the ratio of the target quota to the current quota, judging whether the ratio is approximately equal to 1, and keeping the ratio approximately equal to 1 unchanged, otherwise, expanding and contracting the capacity is needed.
Step2: and calculating the quota to be adjusted Adjust value.
Step3: judging whether the cluster node residual resources can be scheduled or not, entering the step4 when the cluster node residual resources cannot be scheduled, and entering the step5 when the cluster node residual resources can be scheduled.
Step4: attempting to divide Adjust value into x quota according to the residual resource of the cluster schedulable node, and re-entering step2 to calculate quota.
Step5: and checking whether the copy set of the specification exists in the pod controller according to the Adjust value, if so, directly carrying out horizontal expansion and contraction on the copy set, and if not, creating copy set update of the corresponding specification in the pod controller or creating the copy set of the corresponding controller.
Step6: the method comprises the steps of informing a pod controller of a time stamp of an event needing to be operated, wherein the event comprises the following steps: the time stamp is expected to be adjusted, the action is adjusted, and the amount of resources to be adjusted is targeted.
Compared with the prior art, the application has the advantages and effects that:
1. and a copy set of multiple quotas is supported under one pod controller, and resources are efficiently utilized through the specification difference of the pod on the basis of meeting pod requirements.
2. The mode of combining horizontal expansion and contraction volume and vertical expansion and contraction volume is adopted, the total quota after expansion and contraction volume is accurately controlled, jitter and fluctuation are reduced, vertical expansion and contraction volume can be realized, and partial pod can be ensured not to be rebuilt.
3. And predicting the future resource demand quota based on a machine learning algorithm, so that the hysteresis rate is reduced.
See the drawings in the technical scheme section.
The figure is an overall structure diagram for carrying out mixed expansion and contraction based on a machine learning prediction algorithm, and based on the existing horizontal expansion and contraction and vertical expansion and contraction of kubernetes, a mixed expansion and contraction controller is newly added, the machine learning prediction algorithm is increased, quota needed by a pod controller in the future is predicted, and hysteresis rate is reduced. The copy number configuration of different quotas is added in the Pod controller, namely a plurality of copy groups are arranged under one Pod controller, each copy group manages one quota, and the accuracy is increased. The components in the figure function as follows:
the monitoring component supported at present by k8s is mainly used for inquiring the use condition of resources, including core indexes of the use of the resources and custom indexes of the use of the resources. The core index mainly acquires data from components such as kubelet and the like, and the data is provided for a machine learning prediction algorithm by metrics-server. The custom index mainly obtains the index collected by Prometheus through the API provided by Prometheus Adapter, and the collected data is used as the input of a machine learning prediction algorithm.
The machine learning predictive algorithm periodically collects the monitored data via metrics-api and remains in the database. The historical data is cleaned, analyzed and processed, the resource target quota of the pod controller in a future period is predicted, and the data is input to the hybrid expansion-contraction controller. The relevant strategy of how the comprehensive monitoring data is accurately predicted is mainly considered.
The Informier of the design uses the Informier design of the original k8s, relies on a List & Watch mechanism, and locally maintains a cache of the concerned API object; the state change of the objects is timely obtained, then the local cache is updated, and the data is put into the cache after being processed to a certain degree. Meanwhile, the collected data are also used as the input of the mixed expansion-contraction controller and the pod controller.
The Cache is used for storing information Cache between the mixed expansion and contraction controller and the pod controller, and the intermediate result of scheduling is temporarily stored in the Cache.
The mixed expansion and contraction controller obtains the current mixed expansion and contraction object through an index, obtains the data predicted by machine learning and the state of the current object, calculates the difference between the target quota and the current quota, calculates which expansion and contraction strategy should be adopted by the Pod controller through a certain algorithm, and monitors the strategy by the Pod controller.
And through the creation/update/deletion event of the index list/watch resource, according to what mixed expansion and contraction actions are required to be carried out, the adjustment of the pod controller is carried out according to the strategy, and the operation is issued to the copy set. The Pod controller supports the requirement of more than one quota, and each quota is controlled by one copy set; and judging that if the copy set needing to be expanded and contracted exists in the pod controller, carrying out horizontal expansion and contraction on the copy set directly, and if the quota copy set needing to be expanded and contracted does not exist in the pod controller, creating a new copy set.
In some embodiments, the machine learning-based container hybrid expansion and contraction method is applicable to a machine learning-based container hybrid expansion and contraction system, which comprises a monitoring module, a machine learning prediction module, a hybrid expansion and contraction controller, a pod controller, a buffer and an index module. The monitoring module is used for monitoring the resource use condition of the component query, and optionally, the resource use condition of the monitoring component query supported at present by k8s is mainly used, wherein the monitoring module comprises a core index of resource use and a self-defined index of resource use. The core index mainly acquires data from components such as kubelet and the like, and the data is provided for a machine learning prediction algorithm by metrics-server. The user-defined index mainly obtains an index collected by Prometheus through an API provided by Prometheus Adapter, and the collected data is used as input of a machine learning prediction algorithm; the machine learning prediction module is used for predicting the quota needed by the pod controller in the future, and optionally, the machine learning prediction module collects monitoring data through metrics-api at regular time and keeps the monitoring data in a database. The historical data is cleaned, analyzed and processed, the resource target quota of the pod controller in a future period is predicted, and the data is input to the hybrid expansion-contraction controller. Related strategies of how the comprehensive monitoring data are accurately predicted are mainly considered; the mixed capacity-reducing controller is used for calculating the difference between the target quota and the current quota, and optionally, the mixed capacity-reducing controller obtains the current mixed capacity-reducing object through an index, and calculates what capacity-reducing strategy the Pod controller should take through a certain algorithm by obtaining the data predicted by the machine learning module and the state of the current object, and the strategy is monitored by the Pod controller; the pod controller is used for creating/updating/deleting events of the index list/watch resource, and optionally, the pod controller is responsible for adjusting the pod controller according to the strategy and issuing the operation to the copy set according to what kind of mixed expansion and contraction actions are required to be carried out, which are acquired from the Informater. The Pod controller supports the requirement of more than one quota, and each quota is controlled by one copy set; judging that if a copy set needing capacity expansion and contraction exists in the pod controller, carrying out horizontal capacity expansion and contraction on the copy set directly, and if no quota copy set needing capacity expansion and contraction exists in the pod controller, creating a new copy set; the buffer is used for storing information buffer between the mixed expansion and contraction controller and the pod controller, and the intermediate result of dispatching is temporarily stored in the Cache; the index module is used for acquiring state changes of the objects, updating the local cache, processing the data to a certain degree and then putting the processed data into the cache. Meanwhile, the collected data are also used as the input of the mixed expansion-contraction controller and the pod controller.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present application, and are intended to be included within the scope of the present application.

Claims (10)

1. The container mixed expansion and contraction method based on machine learning is characterized by comprising the following steps of:
the metrics server collects system resource use state data through kubelet, wherein the system resource use state data comprises CPU, memory, GPU and disk IO use conditions of pod;
prometheuseadaptar collects custom collected data types by Prometheus, including: request number per second, concurrency number, PPS, delay, interface response success rate;
each pod controller collects real-time data to form an n-dimensional array, wherein the data in the n-dimensional array is n factors influencing the resource usage, and the content of the n-dimensional array is as follows: the method comprises the steps of sorting data according to time, namely [ pod name, timestamp, CPU usage of pod, memory usage of pod, GPU usage of pod, disk IO usage of pod, request per second of pod, concurrency of pod, PPS of pod, delay of pod, interface response success rate of pod ], and storing the data into a database;
based on historical data, two columns are added in a data table, which are respectively a CPU quota set in the pod controller and a memory quota set in the pod controller, and quota setting under the current scale is recorded.
2. The method as recited in claim 1, further comprising:
adding a manual tag column in the data table, wherein 0 represents abnormality, 1 represents normal, 2 represents incapability of judging, and if wrong data exist, removing the data by marking, and removing abnormal values of the data.
3. The method as recited in claim 1, further comprising:
loading data according to the historical data, drawing a chart, repairing distorted data, and improving data quality; synthesizing historical contemporaneous data, and repairing the data adjacent to the current day;
the method comprises the steps of collecting the resource usage of the pod controller to be predicted, calculating expected quota of the pod controller in the future x hours by a machine learning prediction algorithm, limiting the minimum quota MinDesiredValue and the maximum quota MaxDesiredValue, maxDesiredValue of the expected quota according to the current cluster resource remaining amount change, and preventing the predicted data from being too small or too large to influence the operation of other components.
4. A method according to claim 3, further comprising:
the distance between the n-dimensional array x of the predicted pod controller and the n-dimensional array y in the historical data is collected, and the distance between the n-dimensional arrays of the historical data is calculated according to the Euclidean distance calculation method:
where i=1 to n, xi is an element of array x, yi is an element of array y.
5. The method as recited in claim 4, further comprising:
defining m samples closest to the sample to be predicted as neighbors of the sample to be predicted; training the m value continuously, and repeating until the m value with the minimum error rate is obtained; determining the resource quota required by the sample to be classified according to the resource quota required by most samples in the neighbor; the desired quota DesiredBassicValue is output.
6. The method as recited in claim 5, further comprising: based on the basic expected quota DesiredDasiValue, adding a periodic variation factor coefficient s (t) influenced by time, a holiday and special event item factor coefficient h (t), and other influence factor coefficients epsilon (t), and calculating a real target quota: desiredValue = DesiredDasiValue x s (t) x h (t) x epsilon (t); wherein t is time.
7. The method as recited in claim 6, further comprising: and calculating the difference between the target index and the actual index, and determining a strategy by using the mixed expansion-contraction capacity controller.
8. The method as recited in claim 7, further comprising:
calculating the ratio of the current quota CurrentMetricValue to the actual target quota DesiredValue: ratio = currentmetric value/DesiredValue;
the ratio is kept constant when ratio is approximately 1, otherwise the capacity is expanded when currentMetricValue < DesiredValue, and the capacity is contracted when currentMetricValue > DesiredValue.
9. The method of claim 8, further comprising calculating a quota to be adjusted: adjust value = DesiredValue-CurrentMetricValue;
when AdjustValue < MinDesiredValue, adjustvalue=mindesiredvalue;
when AdjustValue > maxdesiridval, adjustvalue=maxdesiridval;
when MinDesiredValue < Adjust Value < MaxDairedValue, if the remaining resources of the nodes in the cluster cannot schedule Adjust Value, attempting to divide the Adjust Value into x quota according to the remaining resources of the schedulable nodes of the cluster, wherein the total sum is Adjust Value;
when MinDesiredValue < Adjust Value < MaxDesredValue, if the cluster can successfully schedule quota Adjust Value, checking whether a copy set with the specification exists in the pod controller according to Adjust Value, if so, directly expanding the copy set horizontally, if not, creating copy set update with the corresponding specification in the pod controller or creating copy set of the corresponding controller, informing the pod controller of the time stamp of the event needing to be operated, wherein the event comprises: the time stamp is expected to be adjusted, the action is adjusted, and the amount of resources to be adjusted is targeted.
10. A machine learning based container hybrid expansion and contraction system, comprising: electronic device for performing the method according to any of claims 1-9.
CN202211554550.9A 2022-12-06 2022-12-06 Container mixed expansion and contraction method and system based on machine learning Pending CN116450280A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211554550.9A CN116450280A (en) 2022-12-06 2022-12-06 Container mixed expansion and contraction method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211554550.9A CN116450280A (en) 2022-12-06 2022-12-06 Container mixed expansion and contraction method and system based on machine learning

Publications (1)

Publication Number Publication Date
CN116450280A true CN116450280A (en) 2023-07-18

Family

ID=87120846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211554550.9A Pending CN116450280A (en) 2022-12-06 2022-12-06 Container mixed expansion and contraction method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN116450280A (en)

Similar Documents

Publication Publication Date Title
US7987106B1 (en) System and methods for forecasting time series with multiple seasonal patterns
CN109376971B (en) Load curve prediction method and system for power consumers
CN111459761B (en) Redis configuration method, device, storage medium and equipment
JP5699715B2 (en) Data storage device and data storage method
US11875368B2 (en) Proactively predicting transaction quantity based on sparse transaction data
Duan et al. Multi-phase sequential preventive maintenance scheduling for deteriorating repairable systems
CN113869801A (en) Maturity state evaluation method and device for enterprise digital middleboxes
US10248618B1 (en) Scheduling snapshots
JP7422272B2 (en) Method and apparatus for facilitating storage of data from industrial automation control systems or power systems
JP2017045143A (en) Process planning system, apparatus, method, and program for silicon wafers
JP2019021032A (en) Simulation device and simulation method
CN111143070A (en) Resource scheduling method and device
JP2008021020A (en) Sales plan creation support system
Grasman et al. Setting basestock levels in multi-product systems with setups and random yield
JPH09285010A (en) Power demand estimation supporting apparatus
CN111931009B (en) Method and device for determining operation maximum path of batch operation
Jiang et al. Effective data management strategy and RDD weight cache replacement strategy in Spark
CN115147136A (en) Natural gas supply and demand difference determining method and device, computer equipment and storage medium
CN116450280A (en) Container mixed expansion and contraction method and system based on machine learning
US11636377B1 (en) Artificial intelligence system incorporating automatic model updates based on change point detection using time series decomposing and clustering
JP2021005387A5 (en)
CN116341740A (en) Urban fuel gas time-by-time load multi-step prediction method and system based on proportional splitting
US11651271B1 (en) Artificial intelligence system incorporating automatic model updates based on change point detection using likelihood ratios
US20180039901A1 (en) Predictor management system, predictor management method, and predictor management program
JP6697082B2 (en) Demand forecasting method, demand forecasting system and program thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination