WO2024007426A1 - 一种基于k8s结合灾备演练故障预测及Pod调度的方法 - Google Patents

一种基于k8s结合灾备演练故障预测及Pod调度的方法 Download PDF

Info

Publication number
WO2024007426A1
WO2024007426A1 PCT/CN2022/114446 CN2022114446W WO2024007426A1 WO 2024007426 A1 WO2024007426 A1 WO 2024007426A1 CN 2022114446 W CN2022114446 W CN 2022114446W WO 2024007426 A1 WO2024007426 A1 WO 2024007426A1
Authority
WO
WIPO (PCT)
Prior art keywords
disaster recovery
drill
data
node
pod
Prior art date
Application number
PCT/CN2022/114446
Other languages
English (en)
French (fr)
Inventor
满新宇
陈世亮
杨梅
王震
朱庭俊
黄嘉伟
Original Assignee
中电信数智科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中电信数智科技有限公司 filed Critical 中电信数智科技有限公司
Publication of WO2024007426A1 publication Critical patent/WO2024007426A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1464Management of the backup or restore process for networked environments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1461Backup scheduling policy

Definitions

  • the invention belongs to the technical field of disaster recovery drills, and specifically relates to a method based on k8s that combines disaster recovery drill fault prediction and Pod scheduling.
  • the technical problem to be solved by the present invention is to address the shortcomings of the above-mentioned existing technologies and provide a method based on k8s that combines disaster recovery drill fault prediction and Pod scheduling, and solves the problem that the k8s cluster management method through the host cannot meet the cross-k8s cluster management needs.
  • Technical issues improve the flexibility of Pod scheduling.
  • a method based on k8s combined with disaster recovery drill fault prediction and Pod scheduling including:
  • Step 1 Create a central scheduling cluster based on k8s on the central cluster management server in the network, and establish a Node node in each subnet in the network.
  • the central scheduling cluster includes: Master node, Node node, data Collection service Pod;
  • Step 2 the Master creates a backup Node node in the non-central designated subnet through the k8s API server and deploys the data collection service Pod, and then sends a request to the Master through the k8s API server to obtain the Node files in various locations participating in the disaster recovery drill.
  • Pod drill data
  • Step 3 Select different model data analysis methods according to business characteristics to analyze the training Pod drill data to build and train a Markov chain model to obtain the probability of possible failures in each step of the next k8s cluster disaster recovery drill;
  • Step 4 The Markov chain model training results and the data involved in the analysis are stored in the historical disaster recovery drill database deployed on the central server.
  • the network node where the central cluster management server is located in step 1 above is the cluster central management node.
  • the central scheduling cluster includes three objects: the master node Master, the Node node, and the data collection service Pod. Its deployment method is:
  • the master is deployed on the central cluster management server and all nodes in the network including the central and local locations are created.
  • the computing node and computing program are deployed on the central cluster management server.
  • the computing program is responsible for obtaining the required disaster information from the backup node node through the k8s API server. Prepare drill data and conduct analysis and calculation of related disaster recovery drill business;
  • the central server deploys a historical disaster recovery drill database and is responsible for storing analysis results.
  • the storage analysis includes: the calculation results of each Pod or Pod set participating in the disaster recovery drill business;
  • the operation results include: operation occurrence time, Pod name, PodIP, belonging Node node, and the result data of this analysis and operation of the remote participating operation Pod and the belonging Node.
  • the Master After receiving the request command, the Master starts to issue data collection instructions to the Nodes participating in the disaster recovery drill in various places, until the data stored in the data collection service Pod under the Node participating in the disaster recovery drill is transferred to the backup Node node.
  • Method 1 Off-site data analysis: Put the disaster recovery drill data at different operating points for data training, and finally perform a collective analysis on the data training results to extract the data results that are closest to the real situation;
  • the backup Node node After receiving the instruction, the backup Node node sends the disaster recovery drill data of the Pod or Pod collection participating in the calculation and analysis to the calculation Node through the k8s API server to extract the closest to the real data results.
  • Step 3 above builds the Markov chain model as follows:
  • P ij represents the probability of moving from a given current disaster recovery drill step j to disaster recovery drill step i;
  • X (n) represents the current disaster recovery drill steps
  • X (n+1) represents the next disaster recovery drill step
  • this random process is a Markov chain.
  • Step 3 above generates a Markov chain data set and trains the Markov chain model
  • the Markov chain data set generation method is:
  • the disaster recovery drill steps include incident reporting, business warning, and disaster assessment;
  • Initial probability of failure in the event reporting step the number of failures in the current step of the disaster recovery drill in the historical disaster recovery drill database/including the total number of disaster recovery drills in the current step;
  • the initial probability of failure in the business warning step the number of failures in the current step of the disaster recovery drill in the historical disaster recovery drill database / the total number of disaster recovery drills including the current step;
  • the initial probability of failure in the disaster assessment step the number of failures in the current step of the disaster recovery drill in the historical disaster recovery drill database / the total number of disaster recovery drills including the current step.
  • the non-initial probability described in 3) above is obtained by accessing the historical disaster recovery drill database through event reporting as query conditions.
  • step 3 the initial probability and non-initial probability generated rectangular data sets are put into the Markov chain model for training, and finally the probability of possible failures in each step of the next k8s cluster disaster recovery drill is obtained.
  • the above step 4 uses the calculation program to store the collected Markov chain model training results and the data involved in the analysis into the historical disaster recovery drill database deployed on the central server;
  • the stored data includes exercise time, participating Pod names, participating Pod IPs, participating Node nodes, disaster recovery exercise failure probability values, disaster recovery exercise abnormality identification, and exercise serial numbers.
  • This invention highlights the advantages of k8s artificial intelligence in the disaster recovery drill process.
  • different methods are used in the disaster recovery drill scenario to achieve more efficient, intelligent and close to real data analysis, artificial intelligence training and storage. process. It can jointly participate in the calculation of private disaster recovery data in multiple places, and the disaster recovery data in multiple places can be flexibly scheduled to participate in analysis and calculation, and the calculation and data are stored separately, thus achieving the goal of improving the data calculation and analysis efficiency of remote disaster recovery and reducing the time required.
  • the resource consumption of the central server solves the technical problem of existing IT business systems in which multi-local disaster recovery private data cannot jointly participate in calculations and multi-local disaster recovery data cannot be flexibly scheduled to participate in analysis and calculations.
  • Figure 1 is a flow chart of the method of the present invention.
  • the method of the present invention is aimed at the problems of large business scale, complex application relationships, multiple dependency levels, and difficulty in troubleshooting in the computer room operation and maintenance scenario. It cannot meet the current cluster's requirements for operation and maintenance management and efficient scheduling processing, and cannot meet the current cluster's requirements for operation and maintenance management. Requirements and efficient scheduling processing and data backup functions. It provides a feasible method based on k8s combined with disaster recovery drill fault prediction and Pod scheduling for cross-cluster backup of cluster pods, business data interaction between clusters, and flexible configuration and scheduling of cluster resources. See Figure 1.
  • the method of the present invention include:
  • Step 1 Create a k8s-based central scheduling cluster on the central cluster management server in the network, and establish a Node node in each subnet in the network.
  • the network node where the central cluster management server is located is referred to as the cluster central management node.
  • the central scheduling cluster mainly includes three objects: Master node, Node node, and data collection service Pod.
  • the computing program is responsible for obtaining the required disaster recovery drill data from the backup Node node through the k8s API server, and performs analysis and calculation of related disaster recovery drill business.
  • the central server deploys a historical disaster recovery drill database and is responsible for storing analysis results, including: the calculation results of each Pod or Pod set participating in the disaster recovery drill business;
  • the operation results include: operation occurrence time, Pod name, PodIP, belonging Node node, and remote participating operation Pod and its belonging Node, etc.
  • the result data of this analysis and operation include: operation occurrence time, Pod name, PodIP, belonging Node node, and remote participating operation Pod and its belonging Node, etc.
  • Step 2 the Master creates a backup Node node in the non-central designated subnet through the k8s API server and deploys the data collection service Pod, and then sends a request to the Master through the k8s API server to obtain the Node data of various locations participating in the disaster recovery drill. Pod walkthrough data.
  • Step 3 Use different model data analysis methods according to business characteristics and build a Markov chain model
  • the data of joint disaster recovery drills in multiple places may cause network delays during the transmission process and due to different business characteristics and inconsistent network security policies in various places, the data in the data packets of joint disaster recovery drills in multiple places may be compromised by the network. Policy interception reduces the authenticity of the calculation data. Therefore, the disaster recovery drill data is placed at different operating points for data training. Finally, a set analysis is performed on the data training results to extract the data results closest to the real reality.
  • any local Node node referred to as: computing point
  • computing point to send a command to the Master (master node) through the k8s API server to obtain the data that participates in the calculation in other places during this multi-site joint disaster recovery drill.
  • the disaster recovery data on the Pod is transmitted to the computing point, and then the [Markov chain model] starts to be built.
  • the [Computing Program] deployed on the [Computing Node] under the Master sends a request to the Master (Master Node) by calling the k8s API server to obtain the disaster response from various places under the [Backup Node].
  • the [Backup Node] sends the disaster recovery drill data of the Pod or Pod set participating in the calculation and analysis to the [Computing Node] through the k8s API server, and then starts to build the [Markov chain model] ⁇ .
  • P ij represents the probability of moving from a given current disaster recovery drill step j to disaster recovery drill step i;
  • X (n) represents the current disaster recovery drill steps
  • X (n+1) represents the next disaster recovery drill step, which only depends on the current drill step
  • i,j,i 0 ,i j ,...,i n-1 ⁇ M call this random process a Markov chain.
  • the [computing program] deployed on the computing Node of the central server is executed to aggregate the Pod data participating in the disaster recovery drill.
  • the name of the step is For query conditions, access the [Historical Disaster Recovery Drill Database] deployed on the central server to drill data where abnormalities occurred in each step of the disaster recovery drill process.
  • the initial probability of failure in each step is obtained through the calculation formula of each step in the disaster recovery drill process and the non-initial training data generated in order from far to near according to the drill time (in order to obtain the probability of failure in each step of each disaster recovery drill) Probability for data preparation).
  • the initial probability is obtained as follows:
  • Initial probability of failure in the event reporting step the number of failures in the current step of the disaster recovery drill in the historical disaster recovery drill database/including the total number of disaster recovery drills in the current step.
  • the initial probability of failure in the business warning step the number of failures in the current step of the disaster recovery drill in the historical disaster recovery drill database / the total number of disaster recovery drills including the current step.
  • the initial probability of failure in the disaster assessment step the number of failures in the current step of the disaster recovery drill in the historical disaster recovery drill database / the total number of disaster recovery drills including the current step.
  • Drills may be conducted on a daily or hourly basis for a month.
  • Non-initial probability acquisition is described as follows:
  • the initial probability of the second exercise is based on the first
  • use x1 event reporting as the query condition to access the [Historical Disaster Recovery Drill Database] to obtain the proportion of failures from event reporting to other steps in the second disaster recovery drill. That is, the increase and decrease in reported faults among the total number of disaster recovery drills in the second month is compared with the first month.
  • the methods for obtaining x2 and x3 are the same.
  • the set of rectangles is:
  • X1 [Event Reporting] is mainly responsible for initiating a disaster recovery drill event and determining the set of participating disaster recovery drill business Pods, including local disaster recovery drill events and local joint disaster recovery drill events.
  • X2 [Business Alert] is mainly responsible for screening disaster recovery drill events that occur to determine whether they are false alarm drill events or non-fault disaster recovery drill events caused by other reasons.
  • X3 [Disaster Assessment] is mainly responsible for obtaining and storing [Historical Disaster Recovery Drill Database] disaster levels (divided into five levels: disaster, emergency, important, minor, and minor).
  • results of the second model training [0.22, 0.4, 0.38] mean that the probability of possible failure in each drill step of the k8s cluster's local and off-site disaster recovery drills in the second month is finally obtained.
  • each training is based on the last training probability according to the characteristics of the Markov chain model.
  • the three-step model training data value is close to or reaches the average value, it ends. That is, when it fluctuates within 10% of the average probability of 0.33 or reaches [0.34, 0.33, 0.33], it means that the probability area is stable and there is no need to perform model training for probability prediction.
  • the model training will be terminated, otherwise until the training in all rectangular sets is completed. Until the number of training.
  • the probability of possible failures in each step of the next k8s cluster disaster recovery drill is obtained. In this way, artificial intelligence can be used to protect remote disaster recovery drills and provide a reference for healthy business operations.
  • Step 4 the [calculation program] stores the collected Markov chain model training results and participating analysis data into the [historical disaster recovery drill database] deployed on the central server, including (drill time, participating Pod name, participating Pod IP, Participating Node nodes, disaster recovery drill failure probability value, disaster recovery drill abnormality identification, drill sequence number). This completes a complete process of calling, analyzing, model training, and result storage of Pod data for a k8s-based disaster recovery drill.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Train Traffic Observation, Control, And Security (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种基于k8s结合灾备演练故障预测及Pod调度的方法,包括在组网内的中央集群管理服务器上创建一个基于k8s的中央调度集群,同时在组网内各地子网建立一个Node节点;在组网内Master在非中央指定子网创建一个备份Node节点并部署数据收集服务Pod,再向Master发送请求获取参与灾备演练的各地Node下Pod演练数据;选择不同的模型数据分析方式,构建并训练马尔可夫链模型,获得下一次k8s集群灾备演练各步骤可能发生故障的概率;对马尔可夫链模型训练结果及参与分析的数据进行存储。解决了通过主机管理k8s集群的方式不能满足跨k8s集群管理需求的技术问题,提高了Pod调度的灵活性。

Description

一种基于k8s结合灾备演练故障预测及Pod调度的方法 技术领域
本发明属于灾备演练技术领域,具体涉及一种基于k8s结合灾备演练故障预测及Pod调度的方法。
背景技术
随着数字化技术的逐渐发展,网络安全已经成为社会发展的重要保证,使得异地灾备更加具有参考价值。灾备数据信息的收集和处理是灾备演练中的一个重要环节,信息收集全面,数据准确能够保证灾备演练的各任务正常执行。演练的处理过程是高度接近真实灾难发生时的处理过程,确保了灾备演练能够对工作起到作用,从而使灾备自动演练对数据维护起到参考作用。
传统意义上的灾备数据收敛及分析方法均存在一定维度局限,存在资源浪费的问题,且各地灾备关联的隐私数据相互分析运算也不方便。不利于运维人员对大型组网内各地灾备系统状态的分析和数据处理。
发明内容
本发明所要解决的技术问题是针对上述现有技术的不足,提供一种基于k8s结合灾备演练故障预测及Pod调度的方法,解决了通过主机管理k8s集群的方式不能满足跨k8s集群管理需求的技术问题,提高了Pod调度的灵活性。
为实现上述技术目的,本发明采取的技术方案为:
一种基于k8s结合灾备演练故障预测及Pod调度的方法,包括:
步骤1、在组网内的中央集群管理服务器上创建一个基于k8s的中央调度集群,同时在组网内各地子网建立一个Node节点,所述中央调度集群包括:主节点Master、Node节点、数据收集服务Pod;
步骤2、在组网内Master通过k8s的API server在非中央指定子网创建一个备份Node节点并部署数据收集服务Pod,再通过k8s的API server向Master发送请求获取参与灾备演练的各地Node下Pod演练数据;
步骤3、根据业务特征选择不同的模型数据分析方式分析训练Pod演练数据,以构建并训练马尔可夫链模型,获得下一次k8s集群灾备演练各步骤可能发生故障的概率;
步骤4、马尔可夫链模型训练结果及参与分析的数据存储到中央服务器部署的历史灾备演练数据库。
为优化上述技术方案,采取的具体措施还包括:
上述的步骤1所述中央集群管理服务器所在网络节点为集群中央管理节点,中央调度集群包括三个对象:主节点Master、Node节点、数据收集服务Pod,其部署方式为:
所述中央集群管理服务器上部署Master以及创建组网内包括中央及各地所有Node,在中央集群管理服务器上部署计算Node和计算程序,计算程序负责通过k8s的API server向备份Node节点获取需要的灾备演练数据,并进行相关灾备演练业务的分析及计算;
中央服务器部署历史灾备演练数据库,负责存储分析结果,存储分析包括:参与灾备演练业务的每个Pod或Pod集合运算结果;
运算结果包括:运算发生时间、Pod名称、PodIP、所属Node节点、及异地参与运算Pod及所属Node本次分析及运算的结果数据。
上述的步骤2中,Master收到请求命令后开始向各地参与灾备演练的Node下发收集数据指令,直到将参与灾备演练的Node下的数据收集服务Pod存储的数据传递给备份Node节点。
上述的步骤3有如下两种数据分析方式可选择:
方式一:异地数据分析:将灾备演练的数据放在不同运算点进行数据训练,最后针对数据训练结果进行集合分析,提取最接近真实的数据结果;
方式二:集中数据分析:备份Node节点集中运算:
首先,获取备份Node节点下各地方参与灾备演练的Pod数据;
其次,备份Node节点收到指令后通过k8s的API server将参与计算及分析的Pod或Pod集合的灾备演练数据发送给计算Node提取最接近真实的数据结果。
上述的步骤3构建马尔可夫链模型如下:
P(X (n+1)=i|X (n)=j,X (n-1)=i (n-1),...,X (0)=i (0))=P ij,n≥0
P ij代表从给定的当前灾备演练步骤j转移到灾备演练步骤i的概率;
X (n)代表当前灾备演练步骤;
X (n+1)代表下一灾备演练步骤;
其中,i,j,i 0,i j,...,i n-1∈M,此随机过程为马尔可夫链。
上述的步骤3生成马尔可夫链数据集合,对马尔可夫链模型进行训练;
所述马尔可夫链数据集合生成方法为:
1)获得灾备演练过程中各步骤发生故障初始概率值及每次灾备演练过程中发生异常的灾 备演练数据集合;
2)通过部署在中央服务器的计算Node上的计算程序执行将参与灾备演练的Pod数据聚合,按照灾备演练步骤,以步骤名称为查询条件访问部署在中央服务器的历史灾备演练数据库灾备演练过程中各步骤发生异常的演练数据;
所述灾备演练步骤包括事件上报、业务预警、灾害评估;
3)通过灾备演练过程中各步骤的计算公式获得各步骤发生故障初始概率和按演练时间从远到近有序生成的非初始概率,构成马尔可夫链数据集合。
上述的3)所述各步骤发生故障初始概率获得方式如下:
事件上报步骤发生故障初始概率:从历史灾备演练数据库中当前步骤灾备演练发生故障的条数/包含当前步骤灾备演练总条数;
业务预警步骤发生故障初始概率=从历史灾备演练数据库中当前步骤灾备演练发生故障的条数/包含当前步骤灾备演练总条数;
灾害评估步骤发生故障初始概率=从历史灾备演练数据库中当前步骤灾备演练发生故障的条数/包含当前步骤灾备演练总条数。
上述的3)所述非初始概率通过事件上报为查询条件访问历史灾备演练数据库获取。
上述的步骤3将初始概率和非初始概率生成矩形数据集合放入马尔可夫链模型进行训练,最终获得下一次k8s集群灾备演练各步骤可能发生故障的概率。
上述的步骤4通过计算程序将收集到的马尔可夫链模型训练结果及参与分析的数据存储到中央服务器部署的历史灾备演练数据库;
存储的数据包括演练时间、参与Pod名称、参与PodIP、参与Node节点、灾备演练发生故障概率值、灾备演练是否异常标识、演练序号。
本发明具有以下有益效果:
本发明突出了k8s人工智能在灾备演练过程中的优势,通过采用分布式存储数据及分析方式在灾备演练场景下采用不同方法达到更加高效智能并接近真实的数据分析、人工智能训练及存储过程。能够对多地方灾备隐私数据共同参与运算且多地方灾备数据可灵活调度参与分析及运算,并将运算与数据分离存储,进而达到了能够提高异地灾备的数据计算及分析效率又减轻了中央服务器的资源消耗,解决了现有IT业务系统的多地方灾备隐私数据无法共同参与运算且多地方灾备数据无法灵活调度参与分析及运算的技术问题。
附图说明
图1为本发明方法流程图。
具体实施方式
以下结合附图对本发明的实施例作进一步详细描述。
本发明的方法针对机房运维场景下业务规模大、应用关系复杂、依赖层次多、排查问题困难的问题,无法满足当前集群对运维管理要求及高效调度处理,无法满足当前集群对运维管理要求及高效调度处理及数据备份功能。为集群pod的跨集群备份,集群之间业务数据交互,及集群资源灵活配置及调度提供了一种可行性基于k8s结合灾备演练故障预测及Pod调度的方法,采见图1,本发明方法包括:
步骤1、在组网内的中央集群管理服务器上创建一个基于k8s的中央调度集群,同时在组网内各地子网建立一个Node节点。
其中,中央集群管理服务器所在网络节点简称为集群中央管理节点。
中央调度集群主要包括三个对象:主节点Master、Node节点、数据收集服务Pod。
其特征是分析方法包括:
Master:
首先,在中央集群管理服务器上部署Master以及创建组网内包括中央及各地所有Node。
其次,在中央集群管理服务器上部署计算Node和计算程序,计算程序负责通过k8s的API server向备份Node节点获取需要的灾备演练数据,并进行相关灾备演练业务的分析及计算。
中央服务器部署历史灾备演练数据库,负责存储分析结果,包括:参与灾备演练业务的每个Pod或Pod集合运算结果;
运算结果包括:运算发生时间、Pod名称、PodIP、所属Node节点、及异地参与运算Pod及所属Node等本次分析及运算的结果数据。
步骤2、在组网内Master通过k8s的API server在非中央指定子网创建一个备份Node节点并部署数据收集服务Pod,再通过k8s的API server向Master发送请求获取参与灾备演练的各地Node下Pod演练数据。
具体描述:
在各地Node下部署一个数据收集服务Pod,负责收集与存储所有该Node节点下的参与灾备演练Pod的数据,Master收到请求命令后开始向各地参与灾备演练的Node下发收集数据指令,直到将参与灾备演练的Node下的数据收集服务Pod存储的数据传递给备份Node节点,从而完成分布式灾备演练数据的收集和存储流程。
步骤3、根据业务特征采用不同的模型数据分析方式,并构建马尔可夫链模型;
方式一:异地数据分析
考虑到多地联合灾备演练的数据在传输过程中可能造成网络延迟等因素及由于各地方业务特征不同,网络安全策略不一致,可能造成多地联合灾备演练的数据报文中数据遭到网络策略拦截从而使运算数据真实性降低。因此,将灾备演练的数据放在不同运算点进行数据训练。最后针对数据训练结果进行集合分析,提取最接近真实的数据结果。
联合异地之间灾备演练数据训练具体描述:
多地联合灾备演练过程中,以其中任意地方Node节点,简称:运算点,通过k8s的API server向Master(主节点)发送命令获取本次灾备多地联合演练过程中,异地参与运算的Pod上灾备数据传输到运算点,然后开始构建【马尔可夫链模型】。
方式二:集中数据分析
【备份Node节点】集中运算、当需要从全局角度对异地灾备演练数据进行数据分析及计算的时候,并不需要考虑联合灾备演练数据丢失情况的时候采用。
灾备演练数据联合集中训练具体描述:
首先,由部署在Master(主节点)下的【计算Node(节点)】上的【计算程序】通过调用k8s的API server向Master(主节点)发送请求获取【备份Node节点】下各地方参与灾备演练的Pod数据。
其次,【备份Node节点】收到指令后通过k8s的API server将参与计算及分析的Pod或Pod集合的灾备演练数据发送给【计算Node(节点)】,然后开始构建【马尔可夫链模型】。
构建【马尔可夫链模型】
公式如下:
P(X (n+1)=i|X (n)=j,X (n-1)=i (n-1),...,X (0)=i (0))=P ij,n≥0
P ij代表从给定的当前灾备演练步骤j转移到灾备演练步骤i的概率;
X (n)代表当前灾备演练步骤;
X (n+1)代表下一灾备演练步骤,仅仅依赖于当前演练步骤;
其中i,j,i 0,i j,...,i n-1∈M称此随机过程为马尔可夫链。
进一步的,生成马尔可夫链数据集合,对马尔可夫链模型进行训练;
马尔可夫链数据集合生成具体描述:
首先,获得灾备演练过程中各步骤发生故障初始概率值及每次灾备演练过程中发生异常的灾备演练数据集合。数据来源【历史灾备演练数据库】。
其次,通过部署在中央服务器的计算Node(节点)上的【计算程序】执行将参与灾备演练的Pod数据聚合,按照灾备演练步骤(事件上报、业务预警、灾害评估),以该步骤名称为查询条件访问部署在中央服务器的【历史灾备演练数据库】灾备演练过程中各步骤发生异常的演练数据。
最后,通过灾备演练过程中各步骤的计算公式获得各步骤发生故障初始概率和按演练时间从远到近有序生成的非初始训练数据(为获得每次灾备演练各步骤是否出现故障的概率进行数据准备)。
初始概率获得描述如下:
事件上报步骤发生故障初始概率:从历史灾备演练数据库中当前步骤灾备演练发生故障的条数/包含当前步骤灾备演练总条数。
业务预警步骤发生故障初始概率=从历史灾备演练数据库中当前步骤灾备演练发生故障的条数/包含当前步骤灾备演练总条数。
灾害评估步骤发生故障初始概率=从历史灾备演练数据库中当前步骤灾备演练发生故障的条数/包含当前步骤灾备演练总条数。
假设第一月灾备演练初始概率【x1=0.6、x2=0.2、x3=0.2】
(一个月可能按天或小时等时间段进行演练。)
非初始概率获得描述如下:
第二次演练初始概率参照第一次
【x1=0.6、x2=0.2、x3=0.2】
首先,通过x1=事件上报为查询条件访问【历史灾备演练数据库】获取第二次灾备演练从事件上报故障转移到其他步骤故障比例。即,第二月灾备演练总次数中事件上报故障增减情况与第一月。x2和x3获取方法相同。
即:
第一次演练x1=0.6的情况下第二次演练概率
【x1=0.2、x2=0.3、x3=0.5】
第一次演练x2=0.2的情况下第二次演练概率
【x1=0.1、x2=0.6、x3=0.3】
第一次演练x3=0.2的情况下第二次演练概率
【x1=0.4、x2=0.5、x3=0.1】
最后,将初始概率和非初始概率生成矩形数据集合放入【马尔可夫链模型】进行训练。
具体步骤描述如下:
矩形集合为:
Figure PCTCN2022114446-appb-000001
X1=【事件上报】主要负责发起一个灾备演练事件并确定本次参与的灾备演练业务Pod集合,包括本地灾备演练事件和各地联合灾备演练事件。
X2=【业务预警】主要负责对发生的灾备演练事件进行筛查从而确定是否为误报演练事件或者其他原因造成的非故障灾备演练事件。
X3=【灾害评估】主要负责获取及存储【历史灾备演练数据库】灾害级别(分为灾难、紧急、重要、次要、轻微五个等级)。
第一次的转移矩阵【X1=0.6、X2=0.2、X3=0.2】
X1=0.6的转移矩阵【X1=0.2、X2=0.3、X3=0.5】
X2=0.2的转移矩阵【X1=0.1、X2=0.6、X3=0.3】
X3=0.6的转移矩阵【X1=0.4、X2=0.5、X3=0.1】
依据模型公式进行训练:
P(X (n+1)=i|X (n)=j,X (n-1)=i (n-1),...,X (0)=i (0))=P ij,n≥0
计算步骤1:
第一次的转移矩阵X1=0.6乘X1=0.2+
第一次的转移矩阵X2=0.2乘X1=0.1+
第一次的转移矩阵X3=0.2乘X1=0.4
第二次演练的事件上报X1=0.22
计算步骤2:
第一次的转移矩阵X1=0.6乘X2=0.3+
第一次的转移矩阵X2=0.2乘X2=0.6+
第一次的转移矩阵X3=0.2乘X2=0.5
第二次演练的业务预警X2=0.4
计算步骤3:
第一次的转移矩阵X1=0.6乘X3=0.5+
第一次的转移矩阵X2=0.2乘X3=0.3+
第一次的转移矩阵X3=0.2乘X3=0.1
第二次演练的业务预警X3=0.38
第一次初始概率【0.6、0.2、0.2】
第二次模型训练结果【0.22、0.4、0.38】即,最终获得k8s集群本地及异地灾备演练第二月各演练步骤可能发生故障的概率。
步骤3中,按照马尔可夫链模型特征每次训练依据上一次训练概率,当三个步骤模型训练数据值接近或达到平均值结束。即,当在平均概率0.33的10%内上下浮动或达到【0.34、0.33、0.33】表示概率区域稳定,不需要再进行概率预测的模型训练则终止执行模型训练,否则直到完成所有矩形集合内训练条数训练为止。最终获得下一次k8s集群灾备演练各步骤可能发生故障的概率。从而通过人工智能手段为异地灾备演练保驾护航,也为业务健康运行提供参考依据。
步骤4、最后,【计算程序】将收集到的马尔可夫链模型训练结果及参与分析的数据存储到中央服务器部署的【历史灾备演练数据库】包括(演练时间、参与Pod名称、参与PodIP、参与Node节点、灾备演练发生故障概率值、灾备演练是否异常标识、演练序号)。从而完成一次基于k8s的灾备演练Pod数据的调用、分析、模型训练、结果存储一个完整的流程。
以上仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,应视为本发明的保护范围。

Claims (10)

  1. 一种基于k8s结合灾备演练故障预测及Pod调度的方法,其特征在于,包括:
    步骤1、在组网内的中央集群管理服务器上创建一个基于k8s的中央调度集群,同时在组网内各地子网建立一个Node节点,所述中央调度集群包括:主节点Master、Node节点、数据收集服务Pod;
    步骤2、在组网内Master通过k8s的API server在非中央指定子网创建一个备份Node节点并部署数据收集服务Pod,再通过k8s的API server向Master发送请求获取参与灾备演练的各地Node下Pod演练数据;
    步骤3、根据业务特征选择不同的模型数据分析方式分析训练Pod演练数据,以构建并训练马尔可夫链模型,获得下一次k8s集群灾备演练各步骤可能发生故障的概率;
    步骤4、马尔可夫链模型训练结果及参与分析的数据存储到中央服务器部署的历史灾备演练数据库。
  2. 根据权利要求1所述的一种基于k8s结合灾备演练故障预测及Pod调度的方法,其特征在于,步骤1所述中央集群管理服务器所在网络节点为集群中央管理节点,中央调度集群包括三个对象:主节点Master、Node节点、数据收集服务Pod,其部署方式为:
    所述中央集群管理服务器上部署Master以及创建组网内包括中央及各地所有Node,在中央集群管理服务器上部署计算Node和计算程序,计算程序负责通过k8s的API server向备份Node节点获取需要的灾备演练数据,并进行相关灾备演练业务的分析及计算;
    中央服务器部署历史灾备演练数据库,负责存储分析结果,存储分析包括:参与灾备演练业务的每个Pod或Pod集合运算结果;
    运算结果包括:运算发生时间、Pod名称、PodIP、所属Node节点、及异地参与运算Pod及所属Node本次分析及运算的结果数据。
  3. 根据权利要求1所述的一种基于k8s结合灾备演练故障预测及Pod调度的方法,其特征在于,步骤2中,Master收到请求命令后开始向各地参与灾备演练的Node下发收集数据指令,直到将参与灾备演练的Node下的数据收集服务Pod存储的数据传递给备份Node节点。
  4. 根据权利要求1所述的一种基于k8s结合灾备演练故障预测及Pod调度的方法,其特征在于,所述步骤3有如下两种数据分析方式可选择:
    方式一:异地数据分析:将灾备演练的数据放在不同运算点进行数据训练,最后针对数据训练结果进行集合分析,提取最接近真实的数据结果;
    方式二:集中数据分析:备份Node节点集中运算:
    首先,获取备份Node节点下各地方参与灾备演练的Pod数据;
    其次,备份Node节点收到指令后通过k8s的API server将参与计算及分析的Pod或Pod集合的灾备演练数据发送给计算Node提取最接近真实的数据结果。
  5. 根据权利要求1所述的一种基于k8s结合灾备演练故障预测及Pod调度的方法,其特征在于,所述步骤3构建马尔可夫链模型如下:
    P(X (n+1)=i|X (n)=j,X (n-1)=i (n-1),...,X (0)=i (0))=P ij,n≥0
    P ij代表从给定的当前灾备演练步骤j转移到灾备演练步骤i的概率;
    X (n)代表当前灾备演练步骤;
    X (n+1)代表下一灾备演练步骤;
    其中,i,j,i 0,i j,...,i n-1∈M,此随机过程为马尔可夫链。
  6. 根据权利要求1所述的一种基于k8s结合灾备演练故障预测及Pod调度的方法,其特征在于,所述步骤3生成马尔可夫链数据集合,对马尔可夫链模型进行训练;
    所述马尔可夫链数据集合生成方法为:
    1)获得灾备演练过程中各步骤发生故障初始概率值及每次灾备演练过程中发生异常的灾备演练数据集合;
    2)通过部署在中央服务器的计算Node上的计算程序执行将参与灾备演练的Pod数据聚合,按照灾备演练步骤,以步骤名称为查询条件访问部署在中央服务器的历史灾备演练数据库灾备演练过程中各步骤发生异常的演练数据;
    所述灾备演练步骤包括事件上报、业务预警、灾害评估;
    3)通过灾备演练过程中各步骤的计算公式获得各步骤发生故障初始概率和按演练时间从远到近有序生成的非初始概率,构成马尔可夫链数据集合。
  7. 根据权利要求6所述的一种基于k8s结合灾备演练故障预测及Pod调度的方法,其特征在于,3)所述各步骤发生故障初始概率获得方式如下:
    事件上报步骤发生故障初始概率:从历史灾备演练数据库中当前步骤灾备演练发生故障的条数/包含当前步骤灾备演练总条数;
    业务预警步骤发生故障初始概率=从历史灾备演练数据库中当前步骤灾备演练发生故障的条数/包含当前步骤灾备演练总条数;
    灾害评估步骤发生故障初始概率=从历史灾备演练数据库中当前步骤灾备演练发生故障的 条数/包含当前步骤灾备演练总条数。
  8. 根据权利要求6所述的一种基于k8s结合灾备演练故障预测及Pod调度的方法,其特征在于,3)所述非初始概率通过事件上报为查询条件访问历史灾备演练数据库获取。
  9. 根据权利要求6所述的一种基于k8s结合灾备演练故障预测及Pod调度的方法,其特征在于,所述步骤3将初始概率和非初始概率生成矩形数据集合放入马尔可夫链模型进行训练,最终获得下一次k8s集群灾备演练各步骤可能发生故障的概率。
  10. 根据权利要求1所述的一种基于k8s结合灾备演练故障预测及Pod调度的方法,其特征在于,所述步骤4通过计算程序将收集到的马尔可夫链模型训练结果及参与分析的数据存储到中央服务器部署的历史灾备演练数据库;
    存储的数据包括演练时间、参与Pod名称、参与PodIP、参与Node节点、灾备演练发生故障概率值、灾备演练是否异常标识、演练序号。
PCT/CN2022/114446 2022-07-06 2022-08-24 一种基于k8s结合灾备演练故障预测及Pod调度的方法 WO2024007426A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210787530.X 2022-07-06
CN202210787530.XA CN115220961A (zh) 2022-07-06 2022-07-06 一种基于k8s结合灾备演练故障预测及Pod调度的方法

Publications (1)

Publication Number Publication Date
WO2024007426A1 true WO2024007426A1 (zh) 2024-01-11

Family

ID=83609323

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/114446 WO2024007426A1 (zh) 2022-07-06 2022-08-24 一种基于k8s结合灾备演练故障预测及Pod调度的方法

Country Status (2)

Country Link
CN (1) CN115220961A (zh)
WO (1) WO2024007426A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160246467A1 (en) * 2015-02-25 2016-08-25 Salesforce.Com, Inc. Automatically generating a walkthrough of an application or an online service
CN111193782A (zh) * 2019-12-18 2020-05-22 北京航天智造科技发展有限公司 Paas云集群构建方法、装置以及电子设备、存储介质
CN113672350A (zh) * 2021-08-20 2021-11-19 深信服科技股份有限公司 一种应用处理方法、装置及相关设备
CN114138549A (zh) * 2021-10-29 2022-03-04 苏州浪潮智能科技有限公司 基于kubernetes系统的数据备份和恢复方法
CN114185679A (zh) * 2021-12-15 2022-03-15 中国工商银行股份有限公司 容器资源调度方法、装置、计算机设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160246467A1 (en) * 2015-02-25 2016-08-25 Salesforce.Com, Inc. Automatically generating a walkthrough of an application or an online service
CN111193782A (zh) * 2019-12-18 2020-05-22 北京航天智造科技发展有限公司 Paas云集群构建方法、装置以及电子设备、存储介质
CN113672350A (zh) * 2021-08-20 2021-11-19 深信服科技股份有限公司 一种应用处理方法、装置及相关设备
CN114138549A (zh) * 2021-10-29 2022-03-04 苏州浪潮智能科技有限公司 基于kubernetes系统的数据备份和恢复方法
CN114185679A (zh) * 2021-12-15 2022-03-15 中国工商银行股份有限公司 容器资源调度方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN115220961A (zh) 2022-10-21

Similar Documents

Publication Publication Date Title
CN107330056B (zh) 基于大数据云计算平台的风电场scada系统及其运行方法
CN104796273B (zh) 一种网络故障根源诊断的方法和装置
CN104113596A (zh) 一种私有云的云监控系统及方法
CN106528341B (zh) 基于Greenplum数据库的自动化容灾系统
WO2020134361A1 (zh) 变电站二次设备状态评估方法、系统及设备
CN113034857A (zh) 一种基于区块链的城市自然灾害监测应急管理调度平台
CN112235142B (zh) 一种可实现关键业务容灾的用电信息采集系统及其运行方法
CN105005518B (zh) 自动聚合冗余系统交易数据的系统及其处理器和方法
Zhu et al. Methodology for reliability assessment of smart grid considering risk of failure of communication architecture
CN106101212A (zh) 云平台下的大数据访问方法
CN108632086A (zh) 一种并行作业运行故障定位方法
CN105306254A (zh) 自动气象站的监控云平台系统及方法
CN109165122B (zh) 一种提升基于区块链技术实现的应用系统同城多园区部署灾备能力的方法
CN110531926A (zh) 一种基于云平台的电力数据管理系统
WO2024007426A1 (zh) 一种基于k8s结合灾备演练故障预测及Pod调度的方法
CN113902583A (zh) 利用低压网络设备数据的配网侧运维方法及系统
CN107918085A (zh) 一种电网广域实时监测系统及控制方法
CN109359810A (zh) 一种基于多策略均衡的电力通信传输网运行状态评估方法
CN109961376A (zh) 一种分散式储能设备管控系统及方法
CN109525422A (zh) 一种日志数据监控管理方法
CN109272597A (zh) 一种新能源发电设备监控与自动组织检修系统
CN112446619B (zh) 配电网抢修处理方法和装置
CN113590368A (zh) 一种基于边边协同的异常设备诊断方法及系统
Yang Research on Risk Intelligent Assessment Method of IT Operation and Maintenance Based on Cloud Computing
CN215954134U (zh) 环保能源企业大规模实时数据通讯状态监控系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22949990

Country of ref document: EP

Kind code of ref document: A1