WO2024001080A1 - 基于人工智能运维的数据库贯穿基础设施的故障定位方法 - Google Patents

基于人工智能运维的数据库贯穿基础设施的故障定位方法 Download PDF

Info

Publication number
WO2024001080A1
WO2024001080A1 PCT/CN2022/139853 CN2022139853W WO2024001080A1 WO 2024001080 A1 WO2024001080 A1 WO 2024001080A1 CN 2022139853 W CN2022139853 W CN 2022139853W WO 2024001080 A1 WO2024001080 A1 WO 2024001080A1
Authority
WO
WIPO (PCT)
Prior art keywords
alarm
transaction
database
key performance
performance indicator
Prior art date
Application number
PCT/CN2022/139853
Other languages
English (en)
French (fr)
Inventor
刘睿民
林秀峰
Original Assignee
北京柏睿数据技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京柏睿数据技术股份有限公司 filed Critical 北京柏睿数据技术股份有限公司
Publication of WO2024001080A1 publication Critical patent/WO2024001080A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention belongs to the field of information technology, and specifically relates to a database-penetrating infrastructure fault location method based on artificial intelligence operation and maintenance.
  • the present invention provides a database-penetrating infrastructure fault location method based on artificial intelligence operation and maintenance, which can effectively solve the above problems.
  • the present invention provides a database-penetrating infrastructure fault location method based on artificial intelligence operation and maintenance, which includes the following steps:
  • Step 1 Build an intelligent operation and maintenance big data distributed platform.
  • the intelligent operation and maintenance big data distributed platform includes a distributed storage unit and a distributed computing platform;
  • Step 2 Within a preset time period, collect the key performance indicator vectors of the IaaS infrastructure layer and the alarm information of the database operation; wherein, each key performance indicator vector is an n-dimensional vector, including n key performance indicators;
  • Step 3 Perform standardized preprocessing on the key performance indicator vectors of the IaaS infrastructure layer to obtain standardized key performance indicator vectors;
  • Step 4 Jointly analyze the standardized key performance indicator vectors collected at different times and the alarm information generated at different times to obtain the source of the alarm information;
  • Step 5 Divide a set of alarm information within a continuous period into an alarm transaction, thereby obtaining multiple alarm transactions; mark the alarm source of each alarm transaction; among them, the alarm source of each alarm transaction is the corresponding time of the alarm transaction A vector combination formed by the standardized key performance indicator vectors collected in each segment;
  • Step 6 Use the alarm source marked by each alarm transaction as the label of the alarm transaction, use the alarm transaction as the input, and use the probability that each alarm transaction belongs to each type of alarm source as the output to train the CNN convolutional neural network.
  • the trained CNN convolutional neural network is the fault location and root cause analysis classification model;
  • Step 7 real-time data fault diagnosis and root cause analysis:
  • the alarm information in the continuous period is treated as an alarm transaction and input into the fault location and root cause analysis classification model, and the probability corresponding to each type of alarm source is output, and the alarm with the highest probability is obtained. Root cause analysis is completed to complete the database alarm root cause analysis.
  • the key performance indicator vector includes 6 key performance indicators, which are: server IP address, server CPU occupancy, server memory occupancy, server hard disk read and write rate, server hard disk space occupancy and network real-time rate. .
  • the alarm information of database operation includes 39 categories, which are: general alarm information, no data alarm, unfinished SQL statement, connection exception, trigger action exception, unsupported function, invalid transaction start, Locator exception, invalid role specification, diagnostic exception, cardinality violation, data exception, integrity constraint violation, invalid cursor state, invalid transaction state, invalid SQL statement name, trigger data change violation, invalid authorization specification, dependency Privilege descriptor still exists, invalid transaction termination, SQL routine exception, invalid cursor name, external routine exception, external routine call exception, savepoint exception, invalid directory name, invalid schema name, transaction rollback, Syntax error or access rule violation, check option violation, insufficient resources, program limit exceeded, object not in prerequisite state, operator intervention, system error, snapshot failure, configuration file error, external data wrapper error, internal error alert.
  • categories which are: general alarm information, no data alarm, unfinished SQL statement, connection exception, trigger action exception, unsupported function, invalid transaction start, Locator exception, invalid role specification, diagnostic exception, cardinality violation, data exception, integrity constraint violation, invalid cursor state, invalid transaction state, invalid SQL
  • step 3 is specifically:
  • X(t) (X 1 , , respectively: X 1 , X 2 ,..., X n ;
  • the following method is used for standardization to obtain the key performance indicator after standardization.
  • is the standard deviation of X 11 , X 21 ,..., X u1 ;
  • step 5 is specifically:
  • Step 5.1 for a certain alarm source Ga, its occurrence time is sa and its elimination time is fa;
  • Step 5.2 preset x and y values
  • Step 5.3 Set thresholds y_max and T_max for the time interval of alarm transaction S(1) in advance so that they satisfy the constraints of formula (2) and formula (3):
  • Step 5.4 if the [sa, fa+y] time period contains alarm information marked as other alarm source Gb, then merge the alarm information x minutes before the alarm source Gb occurs and y minutes after the alarm source Gb is eliminated into the alarm Transaction S(1), that is: treat the alarm information in the following time interval as an alarm transaction [sa-x, min(max(fa, fb)+y, sa-x+T_max)].
  • Figure 1 is a schematic flow chart of a fault location method for database penetration infrastructure based on artificial intelligence operation and maintenance provided by the present invention
  • Figure 2 is a schematic diagram of alarm thing segmentation provided by the present invention.
  • Figure 3 is a schematic diagram of the alarm transactions Ga and Gb provided by the present invention being merged into one alarm transaction;
  • Figure 4 is a schematic diagram of the CNN convolution application network provided by the present invention.
  • the technical solution closest to this application is the invention patent with application number CN201610922085.8, a performance fault location method for distributed databases.
  • This invention provides a performance fault location method for distributed databases. The location execution Slow performance fault node; determine whether the SQL execution plan of the performance fault node has changed. If so, the performance fault location is completed and the SQL execution plan of the performance fault node is optimized. If not, check the system resource load, coordinator performance and User network status until the performance fault is located.
  • This patent only uses the information about whether the SQL execution plan of the failed node has changed to identify the location of the database performance failure.
  • this invention creatively aggregates various types of alarm information based on 6 types of key performance indicators of the IaaS infrastructure layer and 39 types of operational alarm information of the database, establishes an artificial intelligence model to analyze the root cause of the fault, and uses correlation analysis depth Dig into the root cause of the problem.
  • the present invention only conducts research on the database, makes full analysis and utilization of database alarms, and is more practical in improving the processing performance of the database.
  • the present invention can fully dig into the data processing capabilities of the database, improve the stability and efficiency of the database operation, fundamentally improve the data processing capabilities in the enterprise environment, and further enhance the value of intelligent operation and maintenance work.
  • this application aims to apply artificial intelligence technology to the operation and maintenance of the database, through the information from the database to the infrastructure IaaS layer equipment, and to quickly perform fault location and root cause analysis based on the database alarm information.
  • the present invention provides a database-penetrating infrastructure fault location method based on artificial intelligence operation and maintenance. Referring to Figure 1, it includes the following steps:
  • Step 1 Build an intelligent operation and maintenance big data distributed platform.
  • the intelligent operation and maintenance big data distributed platform includes a distributed storage unit and a distributed computing platform;
  • This intelligent operation and maintenance big data distributed platform is based on open source HDFS, Yam, Zookeeper, Hive, HBase and other Hadoop ecological components, and computing engine distributed storage units such as Spark and Python to collect key operating indicator vectors and system operating log data information.
  • Step 2 Within a preset time period, collect the key performance indicator vectors of the IaaS infrastructure layer and the alarm information of database operation; wherein, each key performance indicator vector is an n-dimensional vector, including n key performance indicators;
  • the key performance indicator vector includes but is not limited to the following 6 key performance indicators, which are: server IP address, server CPU occupancy, server memory occupancy, server hard disk read and write rate, server hard disk space occupancy and network real-time rate.
  • Alarm information for database operation includes but is not limited to the following 39 categories: general alarm information, no data alarm, unfinished SQL statement, connection exception, trigger action exception, unsupported function, invalid transaction startup, locator exception , invalid role specification, diagnostic exception, cardinality violation, data exception, integrity constraint violation, invalid cursor state, invalid transaction state, invalid SQL statement name, triggering data change violation, invalid authorization specification, relying on privilege descriptor still exists, invalid transaction termination, SQL routine exception, invalid cursor name, external routine exception, external routine call exception, savepoint exception, invalid directory name, invalid schema name, transaction rollback, syntax error, or Access rule violation, check option violation, insufficient resources, program limit exceeded, object not in prerequisite state, operator intervention, system error, snapshot failure, configuration file error, external data wrapper error, internal error alert.
  • Step 3 Perform standardized preprocessing on the key performance indicator vectors of the IaaS infrastructure layer to obtain standardized key performance indicator vectors; the purpose of this step is to facilitate subsequent steps to accurately extract key information and avoid invalid data interference.
  • X(t) (X 1 , , respectively: X 1 , X 2 ,..., X n ;
  • the following method is used for standardization to obtain the key performance indicator after standardization.
  • is the standard deviation of X 11 , X 21 ,..., X u1 ;
  • Step 4 Jointly analyze the standardized key performance indicator vectors collected at different times and the alarm information generated at different times to obtain the source of the alarm that caused the alarm information;
  • Step 5 Divide a set of alarm information within a continuous period into an alarm transaction, thereby obtaining multiple alarm transactions;
  • the alarm source of each alarm transaction is a vector combination formed by the standardized key performance indicator vectors collected in the corresponding time period of the alarm transaction;
  • step S1 the intelligent operation and maintenance big data distributed platform established in step S1 is used to preprocess and manually annotate the database alarm information.
  • the purpose is to accurately extract key information for subsequent steps and avoid invalid data interference.
  • a group of alarm information within a continuous period of time is regarded as an alarm transaction, and the source of the alarm transaction marked in the alarm transaction is used to classify the source of the alarm transaction. This can effectively aggregate alarm information and extract key information. Avoid distractions.
  • Step 5 is specifically as follows:
  • Step 5.1 for a certain alarm source Ga, its occurrence time is sa and its elimination time is fa;
  • Step 5.2 preset x and y values
  • Step 5.3 Set thresholds y_max and T_max for the time interval of alarm transaction S(1) in advance so that they satisfy the constraints of formula (2) and formula (3):
  • Step 5.4 refer to Figure 3, if the [sa, fa+y] time period contains alarm information marked as other alarm source Gb, then the alarms x minutes before the alarm source Gb occurs and y minutes after the alarm source Gb is eliminated The information is merged into the alarm transaction S(1), that is, the alarm information in the following time interval is treated as an alarm transaction [sa-x, min(max(fa, fb)+y, sa-x+T_max)].
  • Step 5.5 Repeat steps 5.2 to 5.5 until the alarm transaction with Ga as the time center is determined.
  • Step 5.6 Sort all alarm sources marked in the alarm information according to the alarm generation time to extract alarm transactions.
  • Step 5.7 Starting from the first alarm source Ga marked in the alarm information, follow the above steps to determine the alarm transaction with Ga as the time center.
  • Step 5.8 In chronological order, determine the next alarm source Gb as the alarm transaction in the time center. Among them, if the alarm source Gb is already included in the previous alarm transaction, the alarm source Gb is ignored and the next alarm source is continued to be searched until all alarm sources are included in the alarm transaction.
  • Step 6 Use the alarm source marked by each alarm transaction as the label of the alarm transaction, use the alarm transaction as the input, and use the probability that each alarm transaction belongs to each type of alarm source as the output to train the CNN convolutional neural network.
  • the trained CNN convolutional neural network is the fault location and root cause analysis classification model;
  • the probability that each alarm transaction belongs to each alarm source is calculated through the CNN convolutional neural network. Among them, the greater the probability that an alarm transaction belongs to a certain alarm source, the greater the probability that this type of alarm is the source of this alarm transaction.
  • the CNN convolutional neural network structure is shown in Figure 4.
  • Step 7 real-time data fault diagnosis and root cause analysis:
  • the alarm information in the continuous period is treated as an alarm transaction and input into the fault location and root cause analysis classification model, and the probability corresponding to each type of alarm source is output, and the alarm with the highest probability is obtained. Root cause analysis is completed to complete the database alarm root cause analysis.
  • a database alarm root cause analysis including alarm things, server location, server CPU, server memory, server hard disk, and network can be obtained .
  • the present invention uses specific methods to define alarm transactions and original calculation steps to extract key information and avoid invalid data interference. It can apply CNN convolutional neural network for calculation of subsequent data to improve calculation efficiency. This is one of the key points of this application.
  • the present invention applies the CNN convolution application network algorithm and makes improvements. Applying the CNN convolution application network can quickly determine the root cause of a large number of database alarm information.
  • the improvement of this patent is to introduce expert manual empowerment to correct the calculation results of the CNN convolutional neural network, effectively avoiding the problem of insufficient collection of database alarm information. This may lead to deviations in the calculation results. This is one of the key points of this application.
  • the invention provides a fault location method for database penetration through infrastructure based on artificial intelligence operation and maintenance, which applies artificial intelligence technology to the operation and maintenance of the database, penetrates the information from the database to the infrastructure IaaS layer equipment, and quickly performs fault location based on the database alarm information. and root cause analysis.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Debugging And Monitoring (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

一种基于人工智能运维的数据库贯穿基础设施的故障定位方法,包括以下步骤:搭建智能运维大数据分布式平台;采集得到IaaS基础设施层的关键性能指标向量和数据库运行的告警信息;进行告警根源标注和划分告警事务;对CNN卷积神经网络进行训练;实时数据故障诊断和根因分析。该方法具有以下优点:将人工智能技术运用到数据库的运营维护中,贯穿数据库到基础设施IaaS层设备的信息,根据数据库告警信息快速进行故障定位和根因分析。

Description

基于人工智能运维的数据库贯穿基础设施的故障定位方法 技术领域
本发明属于信息技术领域,具体涉及一种基于人工智能运维的数据库贯穿基础设施的故障定位方法。
背景技术
随着IT技术的高速发展,现代信息网络化的可交互业务系统已经替代传统的纸书、磁带以及光盘等需要依靠物理媒介的业务传播方式,信息网络化让各行业的业务效率得到极大的提高。随着各行业的相关业务系统的搭建,以及大量的相关计算机网络硬件设备的应用开发,带来成倍增长的运维问题和各行业对IT运维工作的需求。
在大型企业信息化建设以及数智化转型的过程中,传统的运维方式愈发难以满足大数据时代自动、高效和智能的运维需求。传统运维被动式的人工干预解决问题的方式存在成本失控、效率低下等诸多弊端,在当今时代中会导致企业难以估量的损失。为了突破运维方式的瓶颈。数据库作为信息系统的核心和基础数据处理技术得到广泛的应用,已经成为企业信息化建设核心部件。但是,目前,企业和相关运维工作人员对数据库运营维护的认识通常仅限于常规参数设置和设备主动报警。贯穿基础设施层的关键指标数据,利用程度不足,缺少联合分析和相关分析。数据库一旦出现告警,运维人员通常只在数据库浅层次上进行维护,无法深层次挖掘出现问题的根本原因,不能充分发挥整体网络的潜在能力。
发明内容
针对现有技术存在的缺陷,本发明提供一种基于人工智能运维的数据库贯穿基础设施的故障定位方法,可有效解决上述问题。
本发明采用的技术方案如下:
本发明提供一种基于人工智能运维的数据库贯穿基础设施的故障定位方法,包括以下步骤:
步骤1,搭建智能运维大数据分布式平台,所述智能运维大数据分布式平台,包括分布式存储单元和分布式计算平台;
步骤2,在预设定时间段内,采集得到IaaS基础设施层的关键性能指标向量和数据库运行的告警信息;其中,每个关键性能指标向量为n维向量,包括n个关键性能指标;
步骤3,对IaaS基础设施层的关键性能指标向量进行标准化预处理,得到标准化处理后的关键性能指标向量;
步骤4,对不同时间采集的标准化处理后的关键性能指标向量和不同时间产生的告警信息进行联合分析,得到引起告警信息的告警根源;
步骤5,将一组连续时间内的告警信息划分为一个告警事务,由此得到多个告警事务;标注每个告警事务的告警根源;其中,每个告警事务的告警根源,是告警事务对应时间段采集到的标准化处理后的关键性能指标向量形成的向量组合;
步骤6,将每个告警事务所标注的告警根源作为该告警事务的标签,以告警事务作为输入,以每个告警事务属于每类告警根源的概率作为输出,对CNN卷积神经网络进行训练,得到训练完成的CNN卷积神经网络即为故障定位及根因分析分类模型;
步骤7,实时数据故障诊断和根因分析:
在数据库实时运行时,当产生告警信息时,将连续时间内的告警信息作为一个告警事务,输入到故障定位及根因分析分类模型,输出其对应每类告警根 源的概率,获得概率最大的告警根源,完成数据库告警根因分析。
优选的,其特征在于,关键性能指标向量包括6个关键性能指标,分别为:服务器IP地址、服务器CPU占用率、服务器内存占用率、服务器硬盘读写速率、服务器硬盘空间占用率和网络实时速率。
优选的,其特征在于,数据库运行的告警信息包括39类,分别为:一般告警信息、无数据告警、尚未完成的SQL语句、连接异常、触发动作异常、不支持的功能、无效的事务启动、定位器异常、无效的角色规范、诊断异常、违反基数、数据异常、违反完整性约束、无效的游标状态、无效的交易状态、无效的SQL语句名称、触发数据更改违规、无效的授权规范、依赖特权描述符仍然存在、无效的交易终止、SQL例程异常、无效的游标名称、外部例程异常、外部例程调用异常、保存点异常、无效的目录名称、无效的架构名称、事务回滚、语法错误或访问规则违规、违反检查选项、资源不足、超出程序限制、对象未处于先决状态、操作员干预、系统错误、快照失败、配置文件错误、外部数据包装器错误、内部错误告警。
优选的,步骤3具体为:
将关键性能指标向量表示为:X(t)=(X 1,X 2,...,X n),含义为:在采集时刻t,采集到的关键性能指标向量,包括n个关键性能指标,分别为:X 1,X 2,...,X n
假设在预设定时间段内,共采集得到u个关键性能指标向量,分别为:X(t 1)=(X 11,X 12,...,X 1n),X(t 2)=(X 21,X 22,...,X 2n),...,X(t u)=(X u1,X u2,...,X un),含义为:在采集时刻t 1,t 2...,t u,分别采集到的关键性能指标向量;
对于关键性能指标X 11,采用以下方法进行标准化处理,得到标准化处理后的关键性能指标
Figure PCTCN2022139853-appb-000001
Figure PCTCN2022139853-appb-000002
其中:
Figure PCTCN2022139853-appb-000003
为X 11,X 21,...,X u1的均值;
σ为X 11,X 21,...,X u1的标准差;
对其他关键性能指标,采用同样方法标准化处理。
优选的,步骤5具体为:
步骤5.1,对于某个告警根源Ga,其发生时间为sa,消除时间为fa;
步骤5.2,预设定x和y值;
选取告警根源Ga发生前x分钟至告警根源Ga消除后y分钟内的一组告警信息作为一个告警事务,即:将[sa-x,fa+y]时间段内的所有告警信息作为一个告警事务S(1);
步骤5.3,预先为告警事务S(1)时间区间设定阈值y_max,T_max,使得其满足公式(2)和公式(3)的约束:
fa-sa+y<y_max  (2)
x+y_max<T_max  (3)
步骤5.4,若[sa,fa+y]时间段内包含有标注为其他的告警根源Gb的告警信息,则将告警根源Gb发生前x分钟以及告警根源Gb消除后y分钟的告警信息合并至告警事务S(1),即:将以下时间区间内的告警信息作为一个告警事务[sa-x,min(max(fa,fb)+y,sa-x+T_max)]。
本发明提供的基于人工智能运维的数据库贯穿基础设施的故障定位方法具有以下优点:
将人工智能技术运用到数据库的运营维护中,贯穿数据库到基础设施IaaS 层设备的信息,根据数据库告警信息快速进行故障定位和根因分析。
附图说明
图1为本发明提供的基于人工智能运维的数据库贯穿基础设施的故障定位方法的流程示意图;
图2为本发明提供的告警事物切分示意图;
图3为本发明提供的告警事务Ga和Gb合并为一个告警事务示意图;
图4为本发明提供的CNN卷积申请网络示意图。
具体实施方式
为了使本发明所解决的技术问题、技术方案及有益效果更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。
随着IT技术的高速发展,现代信息网络化的可交互业务系统已经替代了传统的纸书、磁带以及光盘等需要依靠物理媒介的业务传播方式,信息网络化让各行业的业务效率得到了极大的提高。随着各行业的相关业务系统的搭建,以及大量的相关计算机网络硬件设备的应用开发,带来了成倍增长的运维问题和各行业对IT运维工作的需求。最近几年随着人工智能的发展,各企业都已经逐渐融入了人工智能算法,为很多企业带来了行业难题的解决办法,当IT运维和人工智能相互结合,于是便产生了智能运维AIOps(Artificial Intelligence for IT Operations)。利用机器学习和大数据工具,聚焦具体类别设备的关键性能指标KPI数据和日志数据进行建模分析,研究故障预测和诊断以及根因分析等算法模型,提高智能运维故障发现处置效率,助力大型企业IT运维实现精细化和智能化,是未来智能运维发展的重要方向。
与本申请最为接近的技术方案有申请号为CN201610922085.8的发明专利, 一种用于分布式数据库的性能故障定位方法,该发明提供一种用于分布式数据库的性能故障定位方法,定位执行速度缓慢的性能故障节点;判断性能故障节点的SQL执行计划是否改变,若是,则性能故障定位完成,并优化性能故障节点的SQL执行计划,若否,则依次查看系统资源负载、协调器性能及用户网络状况,直至完成性能故障的定位。该专利仅利用故障节点的SQL执行计划是否改变的信息,识别数据库性能故障的定位所在。与该专利相比,本发明基于IaaS基础设施层的6类关键性能指标和数据库的39类运行告警信息,创造性地聚合各类告警信息,建立人工智能模型分析故障根因,并使用关联分析深度挖掘问题根因。本发明仅针对数据库开展研究,对数据库告警的分析利用更加充分,对数据库的处理性能提升更加具有实用性。本发明可以充分深入挖掘数据库的数据处理能力,提升数据库运行的稳定性和效率,从根本上提高企业环境下的数据处理能力,进一步提升智能运维工作价值。
基于现有技术现状,本申请旨在将人工智能技术运用到数据库的运营维护中,贯穿数据库到基础设施IaaS层设备的信息,根据数据库告警信息快速进行故障定位和根因分析。
本发明提供一种基于人工智能运维的数据库贯穿基础设施的故障定位方法,参考图1,包括以下步骤:
步骤1,搭建智能运维大数据分布式平台,所述智能运维大数据分布式平台,包括分布式存储单元和分布式计算平台;
该智能运维大数据分布式平台基于开源HDFS、Yam、Zookeeper、Hive、HBase等Hadoop生态组件、Spark和Python等计算引擎分布式存储单元用于采集关键运行指标向量和系统运行日志数据信息。
步骤2,在预设定时间段内,采集得到IaaS基础设施层的关键性能指标向量 和数据库运行的告警信息;其中,每个关键性能指标向量为n维向量,包括n个关键性能指标;
作为一种具体实现方式,关键性能指标向量包括但不限于以下6个关键性能指标,分别为:服务器IP地址、服务器CPU占用率、服务器内存占用率、服务器硬盘读写速率、服务器硬盘空间占用率和网络实时速率。
数据库运行的告警信息包括但不限于以下39类,分别为:一般告警信息、无数据告警、尚未完成的SQL语句、连接异常、触发动作异常、不支持的功能、无效的事务启动、定位器异常、无效的角色规范、诊断异常、违反基数、数据异常、违反完整性约束、无效的游标状态、无效的交易状态、无效的SQL语句名称、触发数据更改违规、无效的授权规范、依赖特权描述符仍然存在、无效的交易终止、SQL例程异常、无效的游标名称、外部例程异常、外部例程调用异常、保存点异常、无效的目录名称、无效的架构名称、事务回滚、语法错误或访问规则违规、违反检查选项、资源不足、超出程序限制、对象未处于先决状态、操作员干预、系统错误、快照失败、配置文件错误、外部数据包装器错误、内部错误告警。
步骤3,对IaaS基础设施层的关键性能指标向量进行标准化预处理,得到标准化处理后的关键性能指标向量;本步骤目的为方便后续步骤准确提取关键信息,避免无效数据干扰,
具体的,可采用以下方法进行标准化处理:
将关键性能指标向量表示为:X(t)=(X 1,X 2,...,X n),含义为:在采集时刻t,采集到的关键性能指标向量,包括n个关键性能指标,分别为:X 1,X 2,...,X n
假设在预设定时间段内,共采集得到u个关键性能指标向量,分别为:X(t 1)=(X 11,X 12,...,X 1n),X(t 2)=(X 21,X 22,...,X 2n),...,X(t u)=(X u1,X u2,...,X un),含义为: 在采集时刻t 1,t 2...,t u,分别采集到的关键性能指标向量;
对于关键性能指标X 11,采用以下方法进行标准化处理,得到标准化处理后的关键性能指标
Figure PCTCN2022139853-appb-000004
Figure PCTCN2022139853-appb-000005
其中:
Figure PCTCN2022139853-appb-000006
为X 11,X 21,...,X u1的均值;
σ为X 11,X 21,...,X u1的标准差;
对其他关键性能指标,采用同样方法标准化处理。
步骤4,对不同时间采集的标准化处理后的关键性能指标向量和不同时间产生的告警信息进行联合分析,得到引起告警信息的告警根源;
步骤5,将一组连续时间内的告警信息划分为一个告警事务,由此得到多个告警事务;
标注每个告警事务的告警根源;其中,每个告警事务的告警根源,是告警事务对应时间段采集到的标准化处理后的关键性能指标向量形成的向量组合;
具体的,采用步骤S1建立的智能运维大数据分布式平台,对对数据库告警信息进行预处理和人工标注,其目的为后续步骤准确提取关键信息,避免无效数据干扰。
参考图2,对于告警信息,将一组连续时间内的告警信息作为一个告警事务,利用告警事务中所标注的告警根源对该告警事务的根源进行分类,可以有效聚合告警信息,提取关键信息,避免干扰。
步骤5具体为:
步骤5.1,对于某个告警根源Ga,其发生时间为sa,消除时间为fa;
步骤5.2,预设定x和y值;
选取告警根源Ga发生前x分钟至告警根源Ga消除后y分钟内的一组告警信息作为一个告警事务,即:将[sa-x,fa+y]时间段内的所有告警信息作为一个告警事务S(1);
步骤5.3,预先为告警事务S(1)时间区间设定阈值y_max,T_max,使得其满足公式(2)和公式(3)的约束:
fa-sa+y<y_max  (2)
x+y_max<T_max  (3)
步骤5.4,参考图3,若[sa,fa+y]时间段内包含有标注为其他的告警根源Gb的告警信息,则将告警根源Gb发生前x分钟以及告警根源Gb消除后y分钟的告警信息合并至告警事务S(1),即:将以下时间区间内的告警信息作为一个告警事务[sa-x,min(max(fa,fb)+y,sa-x+T_max)]。
在具体实现上,还可以进行以下步骤:
步骤5.5,重复步骤5.2至步骤5.5,直至确定以Ga为时间中心的告警事务。
步骤5.6,针对告警信息中所有标注的告警根源,按照告警产生时间排序,用于提取告警事务。
步骤5.7,从告警信息中第一个标注的告警根源Ga开始,按照以上步骤,确定以Ga为时间中心的告警事务。
步骤5.8,按照时间顺序,确定下一个告警根源Gb为时间中心的告警事务。其中,若告警根源Gb已经包含于上一个告警事务,则忽略告警根源Gb,继续查找下一个告警根源,直至所有告警根源都包含于告警事务中。
步骤6,将每个告警事务所标注的告警根源作为该告警事务的标签,以告警 事务作为输入,以每个告警事务属于每类告警根源的概率作为输出,对CNN卷积神经网络进行训练,得到训练完成的CNN卷积神经网络即为故障定位及根因分析分类模型;
通过CNN卷积神经网络计算每一个告警事务属于每一个告警根源的概率。其中,一个告警事务属于某个告警根源的概率越大,则该类告警是这个告警事务的根源的概率越大。CNN卷积神经网络结构如图4所示。
本步骤之后,还可以包括:
在得到每一个告警事务属于每一个告警根源的概率后,根据概率的大小,通过专家人工赋予0到1之间的系数相乘进行修正,最终得到每一个告警事务属于最大概率告警根源的概率。
步骤7,实时数据故障诊断和根因分析:
在数据库实时运行时,当产生告警信息时,将连续时间内的告警信息作为一个告警事务,输入到故障定位及根因分析分类模型,输出其对应每类告警根源的概率,获得概率最大的告警根源,完成数据库告警根因分析。
因此,基于步骤S1建立的智能运维大数据分布式平台,与实时数据进行分类判别,可以得到包括告警事物、服务器定位、服务器CPU、服务器内存、服务器硬盘、网络在内的数据库告警根因分析。
本发明提供的基于人工智能运维的数据库贯穿基础设施的故障定位方法具有以下特点:
1.本发明使用特定方法定义告警事务,原创的计算步骤,提取关键信息,避免无效数据干扰,为后续数据能够应用CNN卷积神经网络进行计算,提升计算效率。此为本申请的关键点之一。
2.相较于传统的人工处理数据库告警信息,本发明应用了CNN卷积申请网 络算法并进行了改进。应用CNN卷积申请网络可以快速判断大量数据库告警信息的根因,本专利改进之处在于引入专家人工赋权对CNN卷积神经网络的计算结果进行修正,有效避免了因数据库告警信息采集量不足而导致的计算结果出现偏差。此为本申请的关键点之一。
本发明提供的基于人工智能运维的数据库贯穿基础设施的故障定位方法,将人工智能技术运用到数据库的运营维护中,贯穿数据库到基础设施IaaS层设备的信息,根据数据库告警信息快速进行故障定位和根因分析。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视本发明的保护范围。

Claims (5)

  1. 一种基于人工智能运维的数据库贯穿基础设施的故障定位方法,其特征在于,包括以下步骤:
    步骤1,搭建智能运维大数据分布式平台,所述智能运维大数据分布式平台,包括分布式存储单元和分布式计算平台;
    步骤2,在预设定时间段内,采集得到IaaS基础设施层的关键性能指标向量和数据库运行的告警信息;其中,每个关键性能指标向量为n维向量,包括n个关键性能指标;
    步骤3,对IaaS基础设施层的关键性能指标向量进行标准化预处理,得到标准化处理后的关键性能指标向量;
    步骤4,对不同时间采集的标准化处理后的关键性能指标向量和不同时间产生的告警信息进行联合分析,得到引起告警信息的告警根源;
    步骤5,将一组连续时间内的告警信息划分为一个告警事务,由此得到多个告警事务;标注每个告警事务的告警根源;其中,每个告警事务的告警根源,是告警事务对应时间段采集到的标准化处理后的关键性能指标向量形成的向量组合;
    步骤6,将每个告警事务所标注的告警根源作为该告警事务的标签,以告警事务作为输入,以每个告警事务属于每类告警根源的概率作为输出,对CNN卷积神经网络进行训练,得到训练完成的CNN卷积神经网络即为故障定位及根因分析分类模型;
    步骤7,实时数据故障诊断和根因分析:
    在数据库实时运行时,当产生告警信息时,将连续时间内的告警信息作为一个告警事务,输入到故障定位及根因分析分类模型,输出其对应每类告警根源的概率,获得概率最大的告警根源,完成数据库告警根因分析。
  2. 根据权利要求1所述的基于人工智能运维的数据库贯穿基础设施的故障定位方法,其特征在于,关键性能指标向量包括6个关键性能指标,分别为:服务器IP地址、服务器CPU占用率、服务器内存占用率、服务器硬盘读写速率、服务器硬盘空间占用率和网络实时速率。
  3. 根据权利要求1所述的基于人工智能运维的数据库贯穿基础设施的故障定位方法,其特征在于,数据库运行的告警信息包括39类,分别为:一般告警信息、无数据告警、尚未完成的SQL语句、连接异常、触发动作异常、不支持的功能、无效的事务启动、定位器异常、无效的角色规范、诊断异常、违反基数、数据异常、违反完整性约束、无效的游标状态、无效的交易状态、无效的SQL语句名称、触发数据更改违规、无效的授权规范、依赖特权描述符仍然存在、无效的交易终止、SQL例程异常、无效的游标名称、外部例程异常、外部例程调用异常、保存点异常、无效的目录名称、无效的架构名称、事务回滚、语法错误或访问规则违规、违反检查选项、资源不足、超出程序限制、对象未处于先决状态、操作员干预、系统错误、快照失败、配置文件错误、外部数据包装器错误、内部错误告警。
  4. 根据权利要求1所述的基于人工智能运维的数据库贯穿基础设施的故障定位方法,步骤3具体为:
    将关键性能指标向量表示为:X(t)=(X 1,X 2,...,X n),含义为:在采集时刻t,采集到的关键性能指标向量,包括n个关键性能指标,分别为:X 1,X 2,...,X n
    假设在预设定时间段内,共采集得到u个关键性能指标向量,分别为:X(t 1)=(X 11,X 12,...,X 1n),X(t 2)=(X 21,X 22,...,X 2n),...,X(t u)=(X u1,X u2,...,X un),含义为:在采集时刻t 1,t 2...,t u,分别采集到的关键性能指标向量;
    对于关键性能指标X 11,采用以下方法进行标准化处理,得到标准化处理后 的关键性能指标
    Figure PCTCN2022139853-appb-100001
    Figure PCTCN2022139853-appb-100002
    其中:
    Figure PCTCN2022139853-appb-100003
    为X 11,X 21,...,X u1的均值;
    σ为X 11,X 21,...,X u1的标准差;
    对其他关键性能指标,采用同样方法标准化处理。
  5. 根据权利要求1所述的基于人工智能运维的数据库贯穿基础设施的故障定位方法,步骤5具体为:
    步骤5.1,对于某个告警根源Ga,其发生时间为sa,消除时间为fa;
    步骤5.2,预设定x和y值;
    选取告警根源Ga发生前x分钟至告警根源Ga消除后y分钟内的一组告警信息作为一个告警事务,即:将[sa-x,fa+y]时间段内的所有告警信息作为一个告警事务S(1);
    步骤5.3,预先为告警事务S(1)时间区间设定阈值y_max,T_max,使得其满足公式(2)和公式(3)的约束:
    fa-sa+y<y_max(2)
    x+y_max<T_max(3)
    步骤5.4,若[sa,fa+y]时间段内包含有标注为其他的告警根源Gb的告警信息,则将告警根源Gb发生前x分钟以及告警根源Gb消除后y分钟的告警信息合并至告警事务S(1),即:将以下时间区间内的告警信息作为一个告警事务[sa-x,min(max(fa,fb)+y,sa-x+T_max)]。
PCT/CN2022/139853 2022-06-29 2022-12-19 基于人工智能运维的数据库贯穿基础设施的故障定位方法 WO2024001080A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210746736.8 2022-06-29
CN202210746736.8A CN114968727B (zh) 2022-06-29 2022-06-29 基于人工智能运维的数据库贯穿基础设施的故障定位方法

Publications (1)

Publication Number Publication Date
WO2024001080A1 true WO2024001080A1 (zh) 2024-01-04

Family

ID=82965428

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/139853 WO2024001080A1 (zh) 2022-06-29 2022-12-19 基于人工智能运维的数据库贯穿基础设施的故障定位方法

Country Status (2)

Country Link
CN (1) CN114968727B (zh)
WO (1) WO2024001080A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114968727B (zh) * 2022-06-29 2023-02-10 北京柏睿数据技术股份有限公司 基于人工智能运维的数据库贯穿基础设施的故障定位方法
CN116016120A (zh) * 2023-01-05 2023-04-25 中国联合网络通信集团有限公司 故障处理方法、终端设备和可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110932899A (zh) * 2019-11-28 2020-03-27 杭州东方通信软件技术有限公司 一种应用ai智能故障压缩研究方法及其系统
CN110943857A (zh) * 2019-11-20 2020-03-31 国网湖北省电力有限公司信息通信公司 基于卷积神经网络的电力通信网故障分析及定位方法
CN111897673A (zh) * 2020-07-31 2020-11-06 平安科技(深圳)有限公司 运维故障根因识别方法、装置、计算机设备和存储介质
CN112003718A (zh) * 2020-09-25 2020-11-27 南京邮电大学 一种基于深度学习的网络告警定位方法
US20220070050A1 (en) * 2020-08-28 2022-03-03 Ciena Corporation Aggregating alarms into clusters to display service-affecting events on a graphical user interface
US20220086036A1 (en) * 2019-05-25 2022-03-17 Huawei Technologies Co., Ltd. Alarm Analysis Method and Related Device
CN114968727A (zh) * 2022-06-29 2022-08-30 北京柏睿数据技术股份有限公司 基于人工智能运维的数据库贯穿基础设施的故障定位方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102098175B (zh) * 2011-01-26 2015-07-01 浪潮通信信息系统有限公司 一种移动互联网告警关联规则获取方法
CN107196804B (zh) * 2017-06-01 2020-07-10 国网山东省电力公司信息通信公司 电力系统终端通信接入网告警集中监控系统及方法
US10977154B2 (en) * 2018-08-03 2021-04-13 Dynatrace Llc Method and system for automatic real-time causality analysis of end user impacting system anomalies using causality rules and topological understanding of the system to effectively filter relevant monitoring data
CN111342997B (zh) * 2020-02-06 2022-08-09 烽火通信科技股份有限公司 一种深度神经网络模型的构建方法、故障诊断方法及系统
CN112395170A (zh) * 2020-12-07 2021-02-23 平安普惠企业管理有限公司 智能故障分析方法、装置、设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220086036A1 (en) * 2019-05-25 2022-03-17 Huawei Technologies Co., Ltd. Alarm Analysis Method and Related Device
CN110943857A (zh) * 2019-11-20 2020-03-31 国网湖北省电力有限公司信息通信公司 基于卷积神经网络的电力通信网故障分析及定位方法
CN110932899A (zh) * 2019-11-28 2020-03-27 杭州东方通信软件技术有限公司 一种应用ai智能故障压缩研究方法及其系统
CN111897673A (zh) * 2020-07-31 2020-11-06 平安科技(深圳)有限公司 运维故障根因识别方法、装置、计算机设备和存储介质
US20220070050A1 (en) * 2020-08-28 2022-03-03 Ciena Corporation Aggregating alarms into clusters to display service-affecting events on a graphical user interface
CN112003718A (zh) * 2020-09-25 2020-11-27 南京邮电大学 一种基于深度学习的网络告警定位方法
CN114968727A (zh) * 2022-06-29 2022-08-30 北京柏睿数据技术股份有限公司 基于人工智能运维的数据库贯穿基础设施的故障定位方法

Also Published As

Publication number Publication date
CN114968727B (zh) 2023-02-10
CN114968727A (zh) 2022-08-30

Similar Documents

Publication Publication Date Title
WO2024001080A1 (zh) 基于人工智能运维的数据库贯穿基础设施的故障定位方法
CN109861844B (zh) 一种基于日志的云服务问题细粒度智能溯源方法
CN109559231B (zh) 一种面向区块链的追溯查询方法
US20210192389A1 (en) Method for ai optimization data governance
WO2021159834A1 (zh) 异常信息处理节点分析方法、装置、介质及电子设备
WO2023071761A1 (zh) 一种异常定位方法及装置
Ali et al. A framework to implement data cleaning in enterprise data warehouse for robust data quality
CN110489317B (zh) 基于工作流的云系统任务运行故障诊断方法与系统
CN109325062A (zh) 一种基于分布式计算的数据依赖挖掘方法及系统
Lin et al. BigIN4: Instant, interactive insight identification for multi-dimensional big data
CN114564726A (zh) 一种基于大数据办公的软件漏洞分析方法及系统
WO2023279684A1 (zh) 一种基于命名规则和缓存机制的知识图谱构建的操作方法
CN112966162A (zh) 一种基于数据仓库与中间件的科技资源集成方法及装置
Zubi et al. Using data mining techniques to analyze crime patterns in the libyan national crime data
WO2024027487A1 (zh) 基于智能运维场景的健康度评价方法及装置
US20090171921A1 (en) Accelerating Queries Based on Exact Knowledge of Specific Rows Satisfying Local Conditions
Qiao et al. Cardinality estimator: processing SQL with a vertical scanning convolutional neural network
US11953979B2 (en) Using workload data to train error classification model
Zhu et al. A Performance Fault Diagnosis Method for SaaS Software Based on GBDT Algorithm.
US20230113860A1 (en) Proactive network application problem log analyzer
Xia et al. Source Code Vulnerability Detection Based On SAR-GIN
Chen et al. Research on automatic vulnerability mining model based on knowledge graph
CN114969074A (zh) 一种基于互联网ai外呼的数据库更新方法、系统及装置
CN112732690A (zh) 一种用于慢病检测及风险评估的稳定系统及方法
CN112506906A (zh) 一种基于人工智能技术的数据治理平台

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22949156

Country of ref document: EP

Kind code of ref document: A1