CN111869163A - 一种故障检测的方法、装置及系统 - Google Patents

一种故障检测的方法、装置及系统 Download PDF

Info

Publication number
CN111869163A
CN111869163A CN201880091411.2A CN201880091411A CN111869163A CN 111869163 A CN111869163 A CN 111869163A CN 201880091411 A CN201880091411 A CN 201880091411A CN 111869163 A CN111869163 A CN 111869163A
Authority
CN
China
Prior art keywords
node
nodes
delay data
heartbeat
evaluation values
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201880091411.2A
Other languages
English (en)
Other versions
CN111869163B (zh
Inventor
樊航宇
李勇
方首朔
侯杰
林程勇
何成成
董雯霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Huawei Technologies Co Ltd
Original Assignee
Tsinghua University
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Huawei Technologies Co Ltd filed Critical Tsinghua University
Publication of CN111869163A publication Critical patent/CN111869163A/zh
Application granted granted Critical
Publication of CN111869163B publication Critical patent/CN111869163B/zh
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/20Arrangements for detecting or preventing errors in the information received using signal quality detector
    • H04L1/205Arrangements for detecting or preventing errors in the information received using signal quality detector jitter monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0852Delays
    • H04L43/0864Round trip delays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/805Real-time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/81Threshold
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors
    • H04L43/0829Packet loss
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0852Delays
    • H04L43/087Jitter

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Theoretical Computer Science (AREA)
  • Environmental & Geological Engineering (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种故障检测的方法,所述方法应用于分布式的节点集群,所述节点集群包括多个节点,所述方法由所述多个节点中的任一节点执行,所述任一节点为第一节点,所述方法包括:所述第一节点判断是否满足健康度评估触发条件,当满足所述健康度评估触发条件时,所述第一节点根据所述第一节点与所述节点集群中的其它节点之间的心跳时延数据分别对所述节点集群中的其它节点健康度进行评估,并获得所述集群中的其它节点的健康度的评估结果。

Description

PCT国内申请,说明书已公开。

Claims (34)

  1. PCT国内申请,权利要求书已公开。
CN201880091411.2A 2018-03-19 2018-03-19 一种故障检测的方法、装置及系统 Active CN111869163B (zh)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/079422 WO2019178714A1 (zh) 2018-03-19 2018-03-19 一种故障检测的方法、装置及系统

Publications (2)

Publication Number Publication Date
CN111869163A true CN111869163A (zh) 2020-10-30
CN111869163B CN111869163B (zh) 2022-05-24

Family

ID=67988268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201880091411.2A Active CN111869163B (zh) 2018-03-19 2018-03-19 一种故障检测的方法、装置及系统

Country Status (4)

Country Link
US (1) US20210006484A1 (zh)
EP (1) EP3761559A4 (zh)
CN (1) CN111869163B (zh)
WO (1) WO2019178714A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312234A (zh) * 2021-05-18 2021-08-27 福建天泉教育科技有限公司 一种健康检测的优化方法及终端
CN114285602A (zh) * 2021-11-26 2022-04-05 成都安恒信息技术有限公司 一种分布式业务安全检测方法
CN115225775A (zh) * 2022-09-19 2022-10-21 苏州华兴源创科技股份有限公司 多通道的延迟修正方法、装置、计算机设备
CN115348157A (zh) * 2021-05-14 2022-11-15 中国移动通信集团浙江有限公司 分布式存储集群的故障定位方法、装置、设备及存储介质
CN115550144A (zh) * 2022-11-30 2022-12-30 季华实验室 分布式故障节点预测方法、装置、电子设备及存储介质

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IT201900010362A1 (it) * 2019-06-28 2020-12-28 Telecom Italia Spa Abilitazione della misura di perdita di pacchetti round-trip in una rete di comunicazioni a commutazione di pacchetto
CN111556345B (zh) * 2020-03-19 2023-08-29 视联动力信息技术股份有限公司 一种网络质量检测的方法、装置、电子设备及存储介质
US11811641B1 (en) * 2020-03-20 2023-11-07 Juniper Networks, Inc. Secure network topology
WO2022085260A1 (ja) * 2020-10-22 2022-04-28 パナソニックIpマネジメント株式会社 異常検知装置、異常検知方法及びプログラム
US11584382B2 (en) * 2021-02-12 2023-02-21 Fca Us Llc System and method for malfuncton operation machine stability determination
CN112988463B (zh) * 2021-02-23 2022-08-30 新华三大数据技术有限公司 一种故障节点隔离方法及装置
CN112804113A (zh) * 2021-04-15 2021-05-14 北京全路通信信号研究设计院集团有限公司 一种故障判断方法及系统
CN113760592B (zh) * 2021-07-30 2024-02-27 郑州云海信息技术有限公司 一种节点内核检测方法和相关装置
CN116127149B (zh) * 2023-04-14 2023-07-04 杭州悦数科技有限公司 图数据库集群健康度的量化方法和系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012173996A (ja) * 2011-02-22 2012-09-10 Nec Corp クラスタシステム、クラスタ管理方法、およびクラスタ管理プログラム
CN103023716A (zh) * 2012-11-26 2013-04-03 中怡(苏州)科技有限公司 一种零流量消耗的网络质量监控系统及监控方法
US20140297845A1 (en) * 2013-03-29 2014-10-02 Fujitsu Limited Information processing system, computer-readable recording medium having stored therein control program for information processing device, and control method of information processing system
WO2017008698A1 (zh) * 2015-07-10 2017-01-19 努比亚技术有限公司 多通道路由方法及装置
CN106998302A (zh) * 2016-01-26 2017-08-01 华为技术有限公司 一种业务流量的分配方法及装置
CN107204879A (zh) * 2017-06-05 2017-09-26 浙江大学 一种基于指数移动平均的分布式系统自适应故障检测方法
US20170366436A1 (en) * 2016-06-16 2017-12-21 Hitachi, Ltd. Computer system and method of controlling computer system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
PL1627316T3 (pl) * 2003-05-27 2018-10-31 Vringo Infrastructure Inc. Zbieranie danych w klastrze komputerowym
US7284147B2 (en) * 2003-08-27 2007-10-16 International Business Machines Corporation Reliable fault resolution in a cluster
CN101795234B (zh) * 2010-03-10 2012-02-01 北京航空航天大学 一种基于应用层组播算法的流媒体传输方案
CN102355369B (zh) * 2011-09-27 2014-01-08 华为技术有限公司 虚拟化集群系统及其处理方法和设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012173996A (ja) * 2011-02-22 2012-09-10 Nec Corp クラスタシステム、クラスタ管理方法、およびクラスタ管理プログラム
CN103023716A (zh) * 2012-11-26 2013-04-03 中怡(苏州)科技有限公司 一种零流量消耗的网络质量监控系统及监控方法
US20140297845A1 (en) * 2013-03-29 2014-10-02 Fujitsu Limited Information processing system, computer-readable recording medium having stored therein control program for information processing device, and control method of information processing system
WO2017008698A1 (zh) * 2015-07-10 2017-01-19 努比亚技术有限公司 多通道路由方法及装置
CN106998302A (zh) * 2016-01-26 2017-08-01 华为技术有限公司 一种业务流量的分配方法及装置
US20170366436A1 (en) * 2016-06-16 2017-12-21 Hitachi, Ltd. Computer system and method of controlling computer system
CN107204879A (zh) * 2017-06-05 2017-09-26 浙江大学 一种基于指数移动平均的分布式系统自适应故障检测方法

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115348157A (zh) * 2021-05-14 2022-11-15 中国移动通信集团浙江有限公司 分布式存储集群的故障定位方法、装置、设备及存储介质
CN115348157B (zh) * 2021-05-14 2023-09-05 中国移动通信集团浙江有限公司 分布式存储集群的故障定位方法、装置、设备及存储介质
CN113312234A (zh) * 2021-05-18 2021-08-27 福建天泉教育科技有限公司 一种健康检测的优化方法及终端
CN114285602A (zh) * 2021-11-26 2022-04-05 成都安恒信息技术有限公司 一种分布式业务安全检测方法
CN114285602B (zh) * 2021-11-26 2024-02-02 成都安恒信息技术有限公司 一种分布式业务安全检测方法
CN115225775A (zh) * 2022-09-19 2022-10-21 苏州华兴源创科技股份有限公司 多通道的延迟修正方法、装置、计算机设备
CN115225775B (zh) * 2022-09-19 2022-12-09 苏州华兴源创科技股份有限公司 多通道的延迟修正方法、装置、计算机设备
CN115550144A (zh) * 2022-11-30 2022-12-30 季华实验室 分布式故障节点预测方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN111869163B (zh) 2022-05-24
EP3761559A1 (en) 2021-01-06
US20210006484A1 (en) 2021-01-07
WO2019178714A1 (zh) 2019-09-26
EP3761559A4 (en) 2021-03-17

Similar Documents

Publication Publication Date Title
CN111869163B (zh) 一种故障检测的方法、装置及系统
US10447561B2 (en) BFD method and apparatus
KR101881409B1 (ko) 소프트웨어 정의 네트워크에서 멀티-마스터 선택
CN1794651B (zh) 用于通信网络中问题解决的系统和方法
CN108076019B (zh) 基于流量镜像的异常流量检测方法及装置
CN102404170B (zh) 报文丢失检测方法、装置、及系统
WO2002046928A1 (en) Fault detection and prediction for management of computer networks
US9253029B2 (en) Communication monitor, occurrence prediction method, and recording medium
JP4857226B2 (ja) 無線基地局の障害監視装置および障害監視方法
CN106302001B (zh) 数据通信网络中业务故障检测方法、相关装置及系统
EP2432193A2 (en) Method of data replication in a distributed data storage system and corresponding device
WO2011154024A1 (en) Enhancing accuracy of service level agreements in ethernet networks
US8971871B2 (en) Radio base station, control apparatus, and abnormality detection method
US11652682B2 (en) Operations management apparatus, operations management system, and operations management method
CN113543246B (zh) 网络切换方法及设备
US8788735B2 (en) Interrupt control apparatus, interrupt control system, interrupt control method, and interrupt control program
CN110475244B (zh) 终端管理方法、系统、装置、终端及存储介质
CN114172796A (zh) 通信网络的故障定位方法及相关装置
CN110138657B (zh) 交换机间的聚合链路切换方法、装置、设备及存储介质
CN115242610A (zh) 链路质量监测方法、装置、电子设备和计算机可读存储介质
US20160234344A1 (en) Message log removal apparatus and message log removal method
JP5937955B2 (ja) パケット転送遅延計測装置及び方法及びプログラム
JP2021120827A (ja) 制御システム、制御方法
JP2021120827A5 (zh)
CN111200520A (zh) 网络监控方法、服务器和计算机可读存储介质

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant