JP2024521357A - Mlモデルを用いて準リアルタイムデータ/オフラインデータでデータセンタの大規模な故障の検出 - Google Patents

Mlモデルを用いて準リアルタイムデータ/オフラインデータでデータセンタの大規模な故障の検出 Download PDF

Info

Publication number
JP2024521357A
JP2024521357A JP2023574401A JP2023574401A JP2024521357A JP 2024521357 A JP2024521357 A JP 2024521357A JP 2023574401 A JP2023574401 A JP 2023574401A JP 2023574401 A JP2023574401 A JP 2023574401A JP 2024521357 A JP2024521357 A JP 2024521357A
Authority
JP
Japan
Prior art keywords
data
fault
data center
input data
sources
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2023574401A
Other languages
English (en)
Japanese (ja)
Other versions
JP2024521357A5 (https=
JPWO2022256330A5 (https=
Inventor
モンガ,アマルパル・シン
チェン,ビン
ハミルトン,アレックス・エドワード
Original Assignee
オラクル・インターナショナル・コーポレイション
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by オラクル・インターナショナル・コーポレイション filed Critical オラクル・インターナショナル・コーポレイション
Publication of JP2024521357A publication Critical patent/JP2024521357A/ja
Publication of JP2024521357A5 publication Critical patent/JP2024521357A5/ja
Publication of JPWO2022256330A5 publication Critical patent/JPWO2022256330A5/ja
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0769Readable error formats, e.g. cross-platform generic formats, human understandable formats
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2252Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using fault dictionaries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2257Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using expert systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • G06F11/3062Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations where the monitored property is the power consumption
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/301Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is a virtual computing platform, e.g. logically partitioned systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)
JP2023574401A 2021-06-03 2022-05-31 Mlモデルを用いて準リアルタイムデータ/オフラインデータでデータセンタの大規模な故障の検出 Pending JP2024521357A (ja)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US17/338,478 2021-06-03
US17/338,478 US11397634B1 (en) 2021-06-03 2021-06-03 Detecting datacenter mass outage with near real-time/offline using ML models
PCT/US2022/031614 WO2022256330A1 (en) 2021-06-03 2022-05-31 Detecting datacenter mass outage with near real-time/offline data using ml models

Publications (3)

Publication Number Publication Date
JP2024521357A true JP2024521357A (ja) 2024-05-31
JP2024521357A5 JP2024521357A5 (https=) 2025-05-30
JPWO2022256330A5 JPWO2022256330A5 (https=) 2025-05-30

Family

ID=82399474

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2023574401A Pending JP2024521357A (ja) 2021-06-03 2022-05-31 Mlモデルを用いて準リアルタイムデータ/オフラインデータでデータセンタの大規模な故障の検出

Country Status (5)

Country Link
US (3) US11397634B1 (https=)
EP (1) EP4348429B1 (https=)
JP (1) JP2024521357A (https=)
CN (1) CN117280327B (https=)
WO (1) WO2022256330A1 (https=)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115211168B (zh) * 2020-08-14 2025-06-17 中兴通讯股份有限公司 一种基于ai的负载预测方法
US11397634B1 (en) 2021-06-03 2022-07-26 Oracle International Corporation Detecting datacenter mass outage with near real-time/offline using ML models
US11706130B2 (en) * 2021-07-19 2023-07-18 Cisco Technology, Inc. Root-causing user experience anomalies to coordinate reactive policies in application-aware routing
CN119604834A (zh) 2022-09-23 2025-03-11 甲骨文国际公司 利用无功功率封顶的数据中心级电力管理
US12007734B2 (en) 2022-09-23 2024-06-11 Oracle International Corporation Datacenter level power management with reactive power capping
US12155210B2 (en) 2022-11-08 2024-11-26 Oracle International Corporation Techniques for orchestrated load shedding
CN120167057A (zh) * 2022-11-08 2025-06-17 甲骨文国际公司 编排负载削减的技术
US12045125B2 (en) * 2022-11-15 2024-07-23 Sap Se Alert aggregation and health issues processing in a cloud environment
US20250053472A1 (en) * 2023-08-09 2025-02-13 Siemens Aktiengesellschaft System and corresponding computer-implemented method for identifying exception in behavior of a computer system or of an application executed on the computer system
US20250245630A1 (en) * 2024-01-27 2025-07-31 Charter Communications Operating, Llc Automated power outage detection, reporting and mitigation
CN120386650A (zh) * 2024-01-29 2025-07-29 杭州阿里云飞天信息技术有限公司 宕机故障分析方法、装置、电子设备及介质
US20260037397A1 (en) * 2024-08-05 2026-02-05 Dell Products L.P. Failover and synchronization management for databases

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6889908B2 (en) * 2003-06-30 2005-05-10 International Business Machines Corporation Thermal analysis in a data processing system
WO2008052982A1 (en) * 2006-10-30 2008-05-08 Thomson Licensing Method for indicating a service interruption source
US9843948B2 (en) * 2015-03-18 2017-12-12 T-Mobile Usa, Inc. Pathway-based data interruption detection
US10055277B1 (en) * 2015-09-30 2018-08-21 Amdocs Development Limited System, method, and computer program for performing health checks on a system including a plurality of heterogeneous system components
US10761921B2 (en) * 2017-11-30 2020-09-01 Optumsoft, Inc. Automatic root cause analysis using ternary fault scenario representation
US10623273B2 (en) * 2018-01-02 2020-04-14 Cisco Technology, Inc. Data source modeling to detect disruptive changes in data dynamics
US20190228296A1 (en) * 2018-01-19 2019-07-25 EMC IP Holding Company LLC Significant events identifier for outlier root cause investigation
CN108469987B (zh) * 2018-02-26 2020-12-29 华东师范大学 一种基于中断控制流图的中断验证系统
US10977154B2 (en) * 2018-08-03 2021-04-13 Dynatrace Llc Method and system for automatic real-time causality analysis of end user impacting system anomalies using causality rules and topological understanding of the system to effectively filter relevant monitoring data
EP3637261A1 (en) * 2018-10-10 2020-04-15 Schneider Electric IT Corporation Systems and methods for automatically generating a data center network mapping for automated alarm consolidation
US11126493B2 (en) * 2018-11-25 2021-09-21 Aloke Guha Methods and systems for autonomous cloud application operations
US11347576B2 (en) * 2019-07-23 2022-05-31 Vmware, Inc. Root cause analysis of non-deterministic performance anomalies
US11061393B2 (en) * 2019-08-28 2021-07-13 International Business Machines Corporation Consolidating anomaly root causes and alarms using graphical granger models
US11252014B2 (en) * 2019-09-30 2022-02-15 Dynatrace Llc Forming root cause groups of incidents in clustered distributed system through horizontal and vertical aggregation
US20220027257A1 (en) * 2020-07-23 2022-01-27 Vmware, Inc. Automated Methods and Systems for Managing Problem Instances of Applications in a Distributed Computing Facility
US11176016B1 (en) * 2020-09-22 2021-11-16 International Business Machines Corporation Detecting and managing anomalies in underground sensors for agricultural applications
US12326777B2 (en) * 2021-04-16 2025-06-10 Workspot, Inc. Method and system for real-time identification of root cause of a fault in a globally distributed virtual desktop fabric
US11675648B2 (en) * 2021-04-27 2023-06-13 Microsoft Technology Licensing, Llc Automatic triaging of diagnostics failures
US12204431B2 (en) * 2021-05-07 2025-01-21 Dynatrace Llc Method and system for the on-demand generation of graph-like models out of multidimensional observation data
US11397634B1 (en) * 2021-06-03 2022-07-26 Oracle International Corporation Detecting datacenter mass outage with near real-time/offline using ML models

Also Published As

Publication number Publication date
US20220391278A1 (en) 2022-12-08
US11397634B1 (en) 2022-07-26
EP4348429A1 (en) 2024-04-10
US12045123B2 (en) 2024-07-23
US20230251920A1 (en) 2023-08-10
WO2022256330A1 (en) 2022-12-08
CN117280327B (zh) 2024-04-05
US11656928B2 (en) 2023-05-23
CN117280327A (zh) 2023-12-22
EP4348429B1 (en) 2025-05-07

Similar Documents

Publication Publication Date Title
EP4348429B1 (en) Detecting datacenter mass outage with near real-time/offline data using ml models
JP7818616B2 (ja) コンテナフレームワークのネットワークポリシーを検証するための技術
US11843510B2 (en) Automatically inferring software-defined network policies from the observed workload in a computing environment
US11797414B2 (en) Method and system for failure prediction in cloud computing platforms
US12242332B2 (en) Identifying root cause anomalies in time series
US12135991B2 (en) Management plane orchestration across service cells
US10185614B2 (en) Generic alarm correlation by means of normalized alarm codes
US20240061939A1 (en) Threat change analysis system
EP3239840B1 (en) Fault information provision server and fault information provision method
US11381451B2 (en) Methods, systems, and computer readable mediums for selecting and configuring a computing system to support a replicated application
US20220394107A1 (en) Techniques for managing distributed computing components
US20230342125A1 (en) Enforcement of environmental conditions for cloud applications
JP2024546424A (ja) クラウドインフラストラクチャシステム内のコンピューティングノードの認可のためのエッジアテステーション
US20180060987A1 (en) Identification of abnormal behavior in human activity based on internet of things collected data
US11563628B1 (en) Failure detection in cloud-computing systems
US20230403291A1 (en) Framework for anomaly detection in a cloud environment
JP2025523447A (ja) ランサムウェア攻撃に基づく動的クラウド作業負荷再割り当てのための命令モニタリング
US11366651B2 (en) Framework for hardware-specific analytic plugins
US20250272601A1 (en) Artificial intelligence training using accesibility data
US12210400B2 (en) Techniques for performing fault tolerance validation for a data center
US20250298650A1 (en) Watchdog daemons for self-recovering hypervisors
US20250355780A1 (en) Large scale event fault simulator
US20140040447A1 (en) Management system and program product

Legal Events

Date Code Title Description
A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20250521

A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20250521