WO2018233630A1 - Fault discovery - Google Patents

Fault discovery Download PDF

Info

Publication number
WO2018233630A1
WO2018233630A1 PCT/CN2018/091997 CN2018091997W WO2018233630A1 WO 2018233630 A1 WO2018233630 A1 WO 2018233630A1 CN 2018091997 W CN2018091997 W CN 2018091997W WO 2018233630 A1 WO2018233630 A1 WO 2018233630A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
host
configuration file
name
type
Prior art date
Application number
PCT/CN2018/091997
Other languages
French (fr)
Chinese (zh)
Inventor
黄雷
洪福成
Original Assignee
新华三大数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 新华三大数据技术有限公司 filed Critical 新华三大数据技术有限公司
Publication of WO2018233630A1 publication Critical patent/WO2018233630A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery

Definitions

  • Big data also known as huge amount of data, has the following characteristics: large volume of data, such as the amount of data exceeding 10 terabytes, usually a large data set; large data categories, data from multiple data sources, rich in types and formats, such as structure Data, semi-structured data and unstructured data; data processing speed, in the case of large data volume, real-time data processing; high data authenticity, with social data, enterprise content, transactions, applications
  • large volume of data such as the amount of data exceeding 10 terabytes, usually a large data set
  • large data categories data from multiple data sources, rich in types and formats, such as structure Data, semi-structured data and unstructured data
  • data processing speed in the case of large data volume, real-time data processing
  • high data authenticity with social data, enterprise content, transactions, applications
  • the rise of data requires effective information to ensure the authenticity and security of the data.
  • big data brings convenience to users, and it also poses new challenges to operation and maintenance management.
  • a large number of hosts need to be deployed in a big data cluster. How to efficiently and conveniently discover the faults of these hosts becomes a problem of operation and maintenance management.
  • FIG. 1 is a schematic diagram of an application scenario in an embodiment of the present disclosure
  • FIG. 2 is a flow chart of a fault finding method in an embodiment of the present disclosure
  • FIG. 3 is a functional block diagram of a fault finding apparatus in an embodiment of the present disclosure.
  • FIG. 4 is a hardware configuration diagram of a fault finding apparatus in an embodiment of the present disclosure.
  • a fault discovery method is proposed, which may be applied to a big data cluster (also referred to as a big data system), and the big data cluster may include multiple hosts for processing big data services. Each host deploys a service component and processes big data services through the service component.
  • the big data cluster includes the host 11, the host 12, and the host 13, and the number of hosts in the actual application is more.
  • each host can deploy service components for handling big data services, and the service components of different hosts can be the same or different.
  • the host 11 deploys a NameNode component of the HDFS (Hadoop Distributed File System) service. Based on the NameNode component, the host 11 can implement the following big data services: managing data block mapping and processing client read. Write requests, configure copy policies, manage HDFS namespaces, and more.
  • the host 12 deploys a DataNode component of the HDFS service. Based on the DataNode component, the host 12 can implement a big data service: storing a data block of the client, performing a data block read and write operation, and periodically sending heartbeat information to the NameNode.
  • the host can deploy the split component, the sort component, the composite component, etc. of the MapReduce (Map Reduction) service, and deploy YARN ( Yet Another Resource Negotiator, another resource coordinator) resource manager component, application management component, etc., is not limited to this service component.
  • MapReduce Map Reduction
  • YARN Yet Another Resource Negotiator, another resource coordinator
  • a fault finding device is also provided.
  • the fault discovery device can be deployed on any host in a big data cluster or on any device outside the big data cluster.
  • the fault finding device communicates with the host in the big data cluster to enable the fault finding device to perform fault finding and fault recovery on the host.
  • a plurality of configuration files may be pre-stored in the fault finding device local device or any other device accessible by the fault finding device, and each configuration file may include but is not limited to one or any of the following contents. Combination: ID, file name, description, cluster name, service name, component name, fault type, alarm mode, etc.
  • the configuration file can be generated manually by the user, or can be generated by machine learning such as the failure occurrence history of various service components in the past. For example, if the fault finding device learns that N fault B has occurred for service component A, a configuration file including the service name and component name of service component A and the fault type of fault B may be automatically added.
  • the identifier may be a unique identifier of the configuration file.
  • the identifier of the first configuration file is 1, and the configuration file may be referred to as configuration file 1, and the identifier of the second configuration file is 2, and the configuration file can be referred to as configuration file 2 later.
  • the file name is the name of the configuration file and can be selected according to actual needs.
  • the names of different configuration files may be the same or different.
  • the name of the configuration file may be Chinese, English, or other types of languages.
  • the language of the name is not limited.
  • the name of profile 1 is Failure-finding_A and the name of profile 2 is Failure-finding_B.
  • the description information is a brief description of the configuration file, and can describe the function of the configuration file, the generation time of the configuration file, and the validity period of the configuration file.
  • the description information is not limited.
  • the cluster name is the name of the big data cluster.
  • the cluster name may be "crs”.
  • the service name is a service name corresponding to a service component for processing a big data service, such as an HDFS service, a MapReduce service, and a YARN service.
  • the service name of the configuration file 1 is the HDFS service
  • the service name of the configuration file 2 is the HDFS service.
  • the component name is a component name corresponding to a service component for processing a big data service, such as a NameNode component, a DataNode component, a split component, a sort component, a composite component, a resource manager component, an application management component, and the like.
  • the component name of the configuration file 1 is the NameNode component
  • the component name of the configuration file 2 is the DataNode component.
  • the fault type may include but is not limited to one or any combination of the following: port type (PORT), network type (WEB), performance indicator type (METRICS), and custom type (CUSTOM).
  • the port type indicates whether the port of the host is faulty, such as whether the port is Down or not.
  • the network type indicates whether the network of the host is faulty, such as whether the network is faulty, whether the network is reachable or not.
  • the performance indicator type indicates whether the performance indicator of the host is detected. There is a fault, such as whether the CPU usage reaches the threshold, whether the memory usage reaches the threshold, etc.; the custom type is a fault type that allows the user to customize it, that is, the user can select the fault type to be detected according to actual needs.
  • the alarm mode may include but is not limited to one or any combination of the following: WEB, EMAIL, SNMP (Simple Network Management Protocol), and the like.
  • the above configuration file may be a file in the format of json (JavaScript Object Notation, JavaScript Object Markup Language), or may be in other formats, and is not limited thereto.
  • json JavaScript Object Notation, JavaScript Object Markup Language
  • the fault discovery device can provide a Restful API (Representational State Transfer Application Programming Interface) that allows a third party to create a configuration file, modify a configuration file, and delete a configuration file.
  • Restful API Real State Transfer Application Programming Interface
  • the fault finding method of the embodiment of the present disclosure may include steps 201 to 203.
  • Step 201 The fault finding device acquires a service name and a component name of a service component deployed on a host in the big data cluster.
  • the host can obtain the service name and component name of the service component deployed on the host, and actively send the service name and the component name to the fault discovery device, so that the fault discovery device can obtain the service name and The name of the component.
  • a request message can be sent to the host, the request message being used to request the service name and the component name.
  • the host can send the service name and the component name of the service component deployed on the host to the fault finding device, so that the fault finding device can obtain the service name and the component name.
  • the service name corresponding to the big data service handled by the host 11 is the HDFS service
  • the component name is the NameNode component
  • the host 11 can name the service of the host 11 (such as the HDFS service) and the component name.
  • the NameNode component is sent to the fault finding device, and the fault finding device obtains the service name of the host 11 as an HDFS service, and the component name is a NameNode component.
  • the service name of the big data service handled by the host 12 is the HDFS service
  • the component name is the DataNode component
  • the host 12 can name the service of the host 12 (such as the HDFS service) and the component name (
  • the DataNode component is sent to the fault discovery device, and the fault discovery device obtains the service name of the host 12 as an HDFS service, and the component name is a DataNode component.
  • the step 201 may be set to be performed periodically, or may be set to be executed when a predetermined condition is met, or may be set to be executed in response to a user request, which is not limited in the present disclosure.
  • Step 202 The fault finding device determines a target configuration file including a service name and a component name from a plurality of configuration files stored in advance.
  • the fault finding device may query a plurality of pre-stored configuration files by using the service name and the component name corresponding to the host, and determine, from the plurality of configuration files, a target configuration including the service name and the component name. file.
  • the fault finding device queries the plurality of configuration files by using the HDFS service and the NameNode component corresponding to the host 11, and can determine the configuration file 1 including the HDFS service and the NameNode component, that is, the configuration file 1 is the target configuration file.
  • the fault finding device queries the plurality of configuration files by using the HDFS service and the DataNode component of the host 12, and can determine the configuration file 2 including the HDFS service and the DataNode component, that is, the configuration file 2 is the target configuration file.
  • Step 203 The fault discovery device sends the fault type included in the target configuration file to the host, so that the host performs fault discovery according to the fault discovery policy corresponding to the fault type.
  • the host may perform the following steps A to C.
  • Step A The host receives the fault type included in the target configuration file sent by the fault finding device.
  • the fault finding device may transmit the fault type included in the configuration file 1 to the host 11, and the host receives the fault type included in the configuration file 1.
  • the fault finding device may transmit the fault type included in the configuration file 2 to the host 12, and the host receives the fault type included in the configuration file 2.
  • the fault discovery device can generate a fault probing plan 1 that can carry the type of fault in profile 1.
  • the fault finding device transmits the fault probing plan 1 to the host 11, and after receiving the fault probing plan 1, the host can parse the fault type from the fault probing plan 1.
  • the fault detection plan 1 can carry other content in the configuration file 1, such as the identifier, the file name, the description information, the cluster name, the service name, the component name, and the alarm mode, in addition to the fault type, and the fault detection plan 1
  • the content is not restricted.
  • the fault finding device may also generate a fault probing plan 2, which may carry the fault type in the configuration file 2, and the fault finding device sends the fault probing plan 2 to the host 12, and the host receives the fault probing plan. After 2, the fault type can be resolved from the fault probing plan 2.
  • the fault finding device may periodically send the fault probing plan 1 / the fault probing plan 2, such as sending the fault probing plan 1 / the fault probing plan 2 every 10 seconds, and there is no restriction on the sending period.
  • step B the host queries a fault discovery policy corresponding to the fault type.
  • step C the host performs fault discovery according to the fault finding policy.
  • the correspondence between the fault type and the fault finding policy such as the correspondence between the port type and the fault finding policy 1, and the corresponding relationship between the performance index type and the fault finding policy 2 can be configured on the host 11. It is assumed that the fault type obtained by the host 11 is a port type, and the fault discovery policy 1 corresponding to the port type can be queried, and fault discovery is performed according to the fault finding policy 1, that is, whether the port of the host 11 is faulty, such as the host 11 Whether the port is DOWN.
  • the correspondence between the fault type and the fault discovery policy such as the correspondence between the port type and the fault finding policy 1, the performance index type and the fault finding policy 3 (with the fault finding policy 2 described above), can be configured on the host 12. Correspondence of different). It is assumed that the fault type obtained by the host 12 is a performance indicator type, and the fault finding policy 3 corresponding to the performance indicator type can be queried, and the fault finding is performed according to the fault finding policy 3, that is, whether the performance index of the host 12 is faulty, such as Whether the CPU usage reaches the threshold, whether the memory usage reaches the threshold, and so on.
  • the content of the fault discovery strategy 1 is not limited as long as the host 11 can perform fault discovery according to the fault discovery policy 1, and the host 12 can perform fault discovery according to the fault discovery policy 1.
  • the fault discovery policy 1 includes configuration information for detecting whether a host port has a fault, a detection flow, and the like, and based on the content, it is possible to detect whether the host port has a fault.
  • the content of the fault finding policy 2 and the fault finding policy 3 are not limited, as long as the fault finding of the host can be performed according to the fault finding policy, and details are not described herein again.
  • fault recovery steps D to F may also be involved:
  • Step D The process of the host when it is found that a fault has occurred.
  • the host may send a fault message to the fault finding device, the fault message is used to notify the host that the fault has occurred, and the fault message may carry the fault feature and the fault type.
  • the above fault features may include, but are not limited to, one or any combination of the following: hardware features, system features, service component features, and operational log features.
  • the hardware features may be: a CPU feature of the host (such as CPU usage), a memory feature (such as memory usage), and a disk feature (such as disk usage), and the hardware features are not limited.
  • the system features can be: operating system type (such as Windows, Linux, etc.), operating system version, etc., and there are no restrictions on this system feature.
  • the service component feature may be: a feature related to the service component, such as whether the port of the service component is enabled, whether the service component is in a running state, whether the network state of the service component is abnormal, whether the service component can process the request, etc., the service component feature is not Make restrictions.
  • the characteristics of the running log can be: characteristics extracted from the running log, such as the running time of the host, the running program of the host, and the network behavior of the host.
  • characteristics extracted from the running log such as the running time of the host, the running program of the host, and the network behavior of the host.
  • the above process only gives a few examples of fault characteristics, and the fault features are not limited, and all fault-related features are within the scope of the present disclosure.
  • the host 11 detects the fault according to the fault discovery policy 1 corresponding to the port type, and finds that the host 11 has failed, it determines that the fault type corresponding to the fault is the port type, and obtains according to the current state of the host 11.
  • the fault characteristics corresponding to the fault such as the current CPU characteristics, memory characteristics, disk characteristics of the host 11, the operating system type and operating system version of the host 11, the characteristics related to the service component, the running log characteristics in the running log of the host 11, and the like .
  • the host 12 performs fault detection according to the fault discovery policy 3 corresponding to the performance indicator type, and finds that the host 12 has failed, it determines that the fault type corresponding to the fault is a “performance index type”, and according to the host 12 The current state acquires the fault feature corresponding to the fault.
  • Step E The process of detecting the fault when the fault finding device finds that the host has failed.
  • the processing procedure when the fault finding device finds that the host has failed may be processed in one of the following three manners.
  • the fault discovery device After receiving the fault message sent by the host, the fault discovery device sends an alarm message according to the alarm mode included in the target configuration file, where the alarm message may carry the service name and component name included in the target configuration file, and information about the host (such as the IP address of the host, the identity of the host, etc.).
  • the content of the alarm message is not limited to the foregoing content.
  • the alarm message may also carry the identifier, the file name, the description information, the cluster name, and the like included in the target configuration file, and the content is not limited.
  • the alarm mode included in the configuration file may be one or more of WEB, EMAIL, and SNMP. Therefore, the fault discovery device may send an alarm message by using an alarm manner included in the target configuration file.
  • the fault discovery device sends the fault type included in the configuration file 1 to the host 11, if the fault message sent by the host 11 is received, the alarm message is sent according to the alarm mode included in the configuration file 1, and the service included in the configuration file 1 is carried.
  • the fault finding device may also display the content of the service name, the component name, and the host information included in the target configuration file on the WEB page.
  • the fault finding device After receiving the fault message sent by the host, the fault finding device parses the fault feature and the fault type from the fault message. Then, the fault finding device queries the feature database by using the fault feature and the fault type. If there is a fault recovery strategy matching the fault feature and the fault type in the feature library, the fault finding device sends the fault recovery policy to the fault recovery device. If the fault recovery policy matches the fault feature and the fault type, the user is prompted to recover the fault of the host.
  • the feature library may be located in the fault finding device local device or in any other device accessible by the fault finding device.
  • the fault discovery device can establish a feature library for recording fault characteristics, fault types, and fault recovery strategies in association with each other.
  • This fault recovery strategy can be understood as: when the fault type fault has the fault feature
  • the failure recovery strategy can be used to recover the failure.
  • the feature library may record the fault feature A, the fault type A, the fault recovery strategy A in association with each other, record the fault feature B, the fault type B, the fault recovery strategy B in association with each other, and so on.
  • the fault recovery strategy A can be used to recover the fault.
  • the fault finding device After the fault finding device resolves the fault feature A and the fault type A from the fault message, since the fault recovery strategy A matching the fault feature A and the fault type A exists in the feature library, the fault finding device will fail. Recovery policy A is sent to the host. For another example, after the fault finding device resolves the fault feature C and the fault type C from the fault message, since the fault recovery strategy that matches the fault feature C and the fault type C does not exist in the feature library, the fault finding device prompts The user recovers from the failure of the host.
  • the fault finding device may also acquire a fault recovery policy used by the user to recover the fault from the host, and record the obtained fault recovery policy in association with the fault feature and the fault type.
  • the content of the signature library is continuously updated.
  • the fault finding device prompts the user to recover the fault of the host, and assumes that the user uses the fault recovery policy C to recover the fault of the host. After the recovery is completed, the host can send the failure recovery policy C to the fault discovery device. After the fault finding device obtains the fault recovery policy C used by the user to recover the fault from the host, the fault recovery policy C is recorded in the feature database in association with the fault feature C and the fault type C.
  • the fault discovery device sends the fault recovery policy to the host, and the host may perform the following steps:
  • Step F The fault recovery process when the host receives the fault recovery policy.
  • the host may receive a failure recovery policy sent by the fault discovery device and perform a failure recovery on the current failure of the host according to the failure recovery policy.
  • the host may send the fault type and the fault feature corresponding to the fault to the fault discovery device, and the fault recovery policy returned by the fault discovery device to the host is for the fault feature and the fault type.
  • the fault recovery strategy therefore, this fault recovery strategy can recover the fault that matches the fault feature and the fault type, that is, the fault recovery strategy can recover the fault of the current fault of the host.
  • the fault recovery strategy may include configuration information for failback, a recovery process, and recovery tools (such as deleting files, changing configurations, releasing resources, remounting, restarting), etc., based on which content can be recovered. I will not repeat them here.
  • the fault of the host can be automatically discovered, and the fault of the host can be found efficiently and conveniently, thereby realizing the automatic discovery of the host fault in the big data cluster, and solving the complex operation and maintenance of the big data cluster. High degree, difficulty in finding faults, etc.
  • the device includes:
  • the obtaining module 301 is configured to obtain a service name and a component name of a service component deployed on a host in the big data cluster;
  • a determining module 302 configured to determine, from a plurality of pre-stored configuration files, a target configuration file including the service name and the component name, where the configuration file includes a service name, a component name, and a fault type stored in association with each other ;
  • the sending module 303 is configured to send the fault type included in the target configuration file to the host, so that the host performs fault discovery according to a fault discovery policy corresponding to the fault type.
  • the fault finding apparatus further includes a receiving module (not shown in the figure) for receiving a fault message sent by the host, where the fault message is used to notify the host that a fault has occurred.
  • the sending module 303 is further configured to send, according to an alarm manner included in the target configuration file, an alarm message, where the fault message is sent by the host, where the alarm message includes The service name and component name included in the target configuration file, and the information of the host.
  • the fault message carries a fault feature and a fault type;
  • the fault discovery device further includes a retrieval module (not shown in the figure) for retrieving a fault recovery strategy that matches the fault feature and the fault type.
  • the sending module 303 is further configured to: if the fault recovery policy is retrieved, send the fault recovery policy to the host, so that the host performs fault recovery according to the fault recovery policy.
  • the fault finding apparatus further includes a prompting module (not shown in the figure) for prompting the user to fault the host if the fault finding apparatus does not retrieve the fault recovery policy. Carry out recovery.
  • the fault finding apparatus further includes a recording module (not shown in the figure), configured to acquire a fault recovery policy used by the user to recover the host, and obtain the fault recovery policy and the The fault feature, the fault type is recorded in association.
  • a recording module (not shown in the figure), configured to acquire a fault recovery policy used by the user to recover the host, and obtain the fault recovery policy and the The fault feature, the fault type is recorded in association.
  • the fault discovery device provided by the embodiment of the present disclosure may be specifically shown in FIG. 4 according to a hardware architecture diagram.
  • the fault discovery apparatus can include a machine readable storage medium storing a machine executable instruction and a processor, wherein the processor can communicate with the machine readable storage medium by reading and executing a machine in the machine readable storage medium Execution of the instructions, the fault discovery method described above can be performed.
  • a machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and so forth.
  • the machine-readable storage medium may be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drive (such as a hard disk drive), solid state drive, any type of storage disk. (such as a disc, dvd, etc.), or a similar storage medium, or a combination thereof.
  • the system, device, module or unit illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product having a certain function.
  • a typical implementation device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver, and a game control.
  • embodiments of the present disclosure can be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware aspects. Moreover, embodiments of the present disclosure may take the form of a computer program product embodied on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • these computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the instruction means implements the functions specified in one or more blocks of the flowchart or in a flow or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A fault discovery device obtains the service name and component name of a service component deployed on a host in a big data cluster; the fault discovery device determines a target configuration file comprising the service name and the component name from a plurality of pre-stored configuration files, wherein the configuration file comprises a service name, component name and fault type that are stored in association with each other; and the fault discovery device sends the fault type comprised in the target configuration file to the host so that the host carries out fault discovery according to a fault discovery policy corresponding to the fault type.

Description

故障发现Fault finding
相关申请的交叉引用Cross-reference to related applications
本公开基于并要求2017年6月21日递交的中国专利申请201710474280.3的优先权,其所有内容通过引用包含于此。The present disclosure is based on and claims the benefit of priority to the benefit of the benefit of the benefit of the benefit of the benefit of the benefit of the benefit of the entire disclosure of
背景技术Background technique
大数据又称为巨量资料,具有如下特征:数据体量大,如超过10TB规模的数据量,通常是大型数据集;数据类别大,数据来自多种数据源,种类和格式丰富,如结构化数据、半结构化数据和非结构化数据等;数据处理速度快,在数据量庞大的情况下,能够做到数据实时处理;数据真实性高,随着社交数据、企业内容、交易、应用数据的兴起,需要有效信息确保数据的真实性和安全性。Big data, also known as huge amount of data, has the following characteristics: large volume of data, such as the amount of data exceeding 10 terabytes, usually a large data set; large data categories, data from multiple data sources, rich in types and formats, such as structure Data, semi-structured data and unstructured data; data processing speed, in the case of large data volume, real-time data processing; high data authenticity, with social data, enterprise content, transactions, applications The rise of data requires effective information to ensure the authenticity and security of the data.
随着大数据时代的到来,大数据在给用户带来方便的同时,也对运维管理提出了新的挑战。例如,为了实现大数据的相关功能,需要在大数据集群中部署大量主机,如何高效、便捷地发现这些主机的故障,就成为运维管理的难题。With the advent of the era of big data, big data brings convenience to users, and it also poses new challenges to operation and maintenance management. For example, in order to implement the related functions of big data, a large number of hosts need to be deployed in a big data cluster. How to efficiently and conveniently discover the faults of these hosts becomes a problem of operation and maintenance management.
附图说明DRAWINGS
图1是本公开一种实施方式中的应用场景示意图;FIG. 1 is a schematic diagram of an application scenario in an embodiment of the present disclosure;
图2是本公开一种实施方式中的故障发现方法的流程图;2 is a flow chart of a fault finding method in an embodiment of the present disclosure;
图3是本公开一种实施方式中的故障发现装置的功能模块框图;3 is a functional block diagram of a fault finding apparatus in an embodiment of the present disclosure;
图4是本公开一种实施方式中的故障发现装置的硬件结构图。4 is a hardware configuration diagram of a fault finding apparatus in an embodiment of the present disclosure.
具体实施方式Detailed ways
在本公开实施例使用的术语仅仅是出于描述特定实施例的目的,而非限制本公开。本公开和权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其它含义。还应当理解,本文中使用的术语“和/或”是指包含一个或多个相关联的列出项目的任何或所有可能组合。此外,取决于语境,所使用的词语“如果”可以被解释成为“在…… 时”或“当……时”或“响应于确定”。The terms used in the embodiments of the present disclosure are for the purpose of describing the specific embodiments, and are not intended to limit the disclosure. The singular forms "a", "the" and "the" It should also be understood that the term "and/or" as used herein refers to any and all possible combinations of one or more of the associated listed items. Moreover, depending on the context, the word "if" may be interpreted to mean "at time" or "when" or "in response to determination."
本公开实施例中提出了一种故障发现方法,该方法可以应用于大数据集群(也可以称为大数据系统),该大数据集群可以包括多个用于处理大数据业务的主机。其中,每个主机会部署服务组件,并通过服务组件处理大数据业务。In the embodiment of the present disclosure, a fault discovery method is proposed, which may be applied to a big data cluster (also referred to as a big data system), and the big data cluster may include multiple hosts for processing big data services. Each host deploys a service component and processes big data services through the service component.
参见图1所示,大数据集群包括主机11、主机12和主机13,实际应用中的主机数量会更多。此外,每个主机可以部署用于处理大数据业务的服务组件,不同主机的服务组件可以相同或者不同。Referring to FIG. 1, the big data cluster includes the host 11, the host 12, and the host 13, and the number of hosts in the actual application is more. In addition, each host can deploy service components for handling big data services, and the service components of different hosts can be the same or different.
例如,主机11部署HDFS(Hadoop Distributed File System,Hadoop分布式文件系统)服务的NameNode(名字节点)组件,基于此NameNode组件,主机11可以实现如下大数据业务:管理数据块映射,处理客户端的读写请求,配置副本策略,管理HDFS名称空间等。又例如,主机12部署HDFS服务的DataNode(数据节点)组件,基于此DataNode组件,主机12可以实现如下大数据业务:存储客户端的数据块,执行数据块读写操作,定期向NameNode发送心跳信息。For example, the host 11 deploys a NameNode component of the HDFS (Hadoop Distributed File System) service. Based on the NameNode component, the host 11 can implement the following big data services: managing data block mapping and processing client read. Write requests, configure copy policies, manage HDFS namespaces, and more. For another example, the host 12 deploys a DataNode component of the HDFS service. Based on the DataNode component, the host 12 can implement a big data service: storing a data block of the client, performing a data block read and write operation, and periodically sending heartbeat information to the NameNode.
当然,上述过程只是给出了服务组件的几个示例,实际应用中并不局限于此,如主机可以部署MapReduce(映射归约)服务的拆分组件、排序组件、组合组件等,部署YARN(Yet Another Resource Negotiator,另一种资源协调者)服务的资源管理器组件、应用程序管理组件等,对此服务组件不做限制。Of course, the above process only gives several examples of service components. The actual application is not limited to this. For example, the host can deploy the split component, the sort component, the composite component, etc. of the MapReduce (Map Reduction) service, and deploy YARN ( Yet Another Resource Negotiator, another resource coordinator) resource manager component, application management component, etc., is not limited to this service component.
本公开实施例中,还设置有故障发现装置。该故障发现装置可以部署在大数据集群中的任意一个主机上,也可以部署在大数据集群外部的任意装置上。此外,故障发现装置和大数据集群中的主机之间进行通信,以使故障发现装置对主机进行故障发现与故障恢复。In the embodiment of the present disclosure, a fault finding device is also provided. The fault discovery device can be deployed on any host in a big data cluster or on any device outside the big data cluster. In addition, the fault finding device communicates with the host in the big data cluster to enable the fault finding device to perform fault finding and fault recovery on the host.
本公开实施例中,可以在故障发现装置本机或者在该故障发现装置能够访问的任意其他装置中预先存储多个配置文件,且每个配置文件均可以包括但不限于以下内容之一或者任意组合:标识、文件名称、描述信息、集群名称、服务名称、组件名称、故障类型、告警方式等。该配置文件可以由用户手动设定生成,也可以通过对以往各种服务组件的故障发生历史进行机器学习等方式生成。例如,如果故障发现装置学习到针对服务组件A已经发生了 N次故障B,则可以自动增加一个包括服务组件A的服务名称和组件名称以及故障B的故障类型的配置文件。In the embodiment of the present disclosure, a plurality of configuration files may be pre-stored in the fault finding device local device or any other device accessible by the fault finding device, and each configuration file may include but is not limited to one or any of the following contents. Combination: ID, file name, description, cluster name, service name, component name, fault type, alarm mode, etc. The configuration file can be generated manually by the user, or can be generated by machine learning such as the failure occurrence history of various service components in the past. For example, if the fault finding device learns that N fault B has occurred for service component A, a configuration file including the service name and component name of service component A and the fault type of fault B may be automatically added.
其中,标识可以是配置文件的唯一标识,例如,存在2个配置文件,第一个配置文件的标识为1,且后续可以将该配置文件称为配置文件1,第二个配置文件的标识为2,且后续可以将该配置文件称为配置文件2。The identifier may be a unique identifier of the configuration file. For example, there are two configuration files. The identifier of the first configuration file is 1, and the configuration file may be referred to as configuration file 1, and the identifier of the second configuration file is 2, and the configuration file can be referred to as configuration file 2 later.
其中,文件名称是配置文件的名称,可以根据实际需要选择。不同配置文件的名称可以相同,也可以不同,而且,配置文件的名称可以是中文,也可以是英文,还可以是其它类型的语言,对此名称的语言不做限制。例如,配置文件1的名称是Failure-finding_A,配置文件2的名称是Failure-finding_B。The file name is the name of the configuration file and can be selected according to actual needs. The names of different configuration files may be the same or different. Moreover, the name of the configuration file may be Chinese, English, or other types of languages. The language of the name is not limited. For example, the name of profile 1 is Failure-finding_A and the name of profile 2 is Failure-finding_B.
其中,描述信息是配置文件的简要说明,可以阐述配置文件的功能、配置文件的生成时间、配置文件的有效期等内容,对此描述信息不做限制。The description information is a brief description of the configuration file, and can describe the function of the configuration file, the generation time of the configuration file, and the validity period of the configuration file. The description information is not limited.
其中,集群名称是大数据集群的名称,例如,针对主机11、主机12和主机13组成的这个大数据集群,其集群名称可以是“crs”。The cluster name is the name of the big data cluster. For example, for the big data cluster composed of the host 11, the host 12, and the host 13, the cluster name may be "crs".
其中,服务名称是用于处理大数据业务的服务组件对应的服务名称,如HDFS服务、MapReduce服务、YARN服务等。后续以配置文件1的服务名称是HDFS服务,配置文件2的服务名称是HDFS服务为例。The service name is a service name corresponding to a service component for processing a big data service, such as an HDFS service, a MapReduce service, and a YARN service. The service name of the configuration file 1 is the HDFS service, and the service name of the configuration file 2 is the HDFS service.
其中,组件名称是用于处理大数据业务的服务组件对应的组件名称,如NameNode组件、DataNode组件、拆分组件、排序组件、组合组件、资源管理器组件、应用程序管理组件等。后续以配置文件1的组件名称是NameNode组件,配置文件2的组件名称是DataNode组件为例。The component name is a component name corresponding to a service component for processing a big data service, such as a NameNode component, a DataNode component, a split component, a sort component, a composite component, a resource manager component, an application management component, and the like. The component name of the configuration file 1 is the NameNode component, and the component name of the configuration file 2 is the DataNode component.
其中,故障类型可以包括但不限于以下之一或者任意组合:端口类型(PORT)、网络类型(WEB)、性能指标类型(METRICS)、自定义类型(CUSTOM)。其中,端口类型表示检测主机的端口是否存在故障,如端口是否DOWN等;网络类型表示检测主机的网络是否存在故障,如是否联网、网络是否可达等;性能指标类型表示检测主机的性能指标是否存在故障,如CPU使用率是否达到阈值,内存使用率是否达到阈值等;自定义类型是允许用户自由定制的故障类型,即用户可以根据实际需要选择需要检测的故障类型。The fault type may include but is not limited to one or any combination of the following: port type (PORT), network type (WEB), performance indicator type (METRICS), and custom type (CUSTOM). The port type indicates whether the port of the host is faulty, such as whether the port is Down or not. The network type indicates whether the network of the host is faulty, such as whether the network is faulty, whether the network is reachable or not. The performance indicator type indicates whether the performance indicator of the host is detected. There is a fault, such as whether the CPU usage reaches the threshold, whether the memory usage reaches the threshold, etc.; the custom type is a fault type that allows the user to customize it, that is, the user can select the fault type to be detected according to actual needs.
其中,告警方式可以包括但不限于以下之一或者任意组合:WEB、EMAIL、SNMP(Simple Network Management Protocol,简单网络管理协议) 等。The alarm mode may include but is not limited to one or any combination of the following: WEB, EMAIL, SNMP (Simple Network Management Protocol), and the like.
在一个例子中,上述配置文件可以是json(JavaScript Object Notation,JavaScript对象标记语言)格式的文件,也可以是其它格式,对此不做限制。In one example, the above configuration file may be a file in the format of json (JavaScript Object Notation, JavaScript Object Markup Language), or may be in other formats, and is not limited thereto.
在一个例子中,故障发现装置可以提供Restful API(Representational State Transfer Application Programming Interface,表述性状态转移应用程序编程接口),允许第三方创建配置文件、修改配置文件、删除配置文件。In one example, the fault discovery device can provide a Restful API (Representational State Transfer Application Programming Interface) that allows a third party to create a configuration file, modify a configuration file, and delete a configuration file.
基于上述应用场景,如图2所示,本公开实施例的故障发现方法可以包括步骤201至203。Based on the foregoing application scenario, as shown in FIG. 2, the fault finding method of the embodiment of the present disclosure may include steps 201 to 203.
步骤201,故障发现装置获取大数据集群中的主机上部署的服务组件的服务名称和组件名称。Step 201: The fault finding device acquires a service name and a component name of a service component deployed on a host in the big data cluster.
在一个例子中,主机可以获取该主机上部署的服务组件的服务名称和组件名称,并主动将该服务名称和该组件名称发送给故障发现装置,这样,故障发现装置可以获取到该服务名称和该组件名称。在另一个例子中,当故障发现装置需要对主机进行故障发现时,则可以向该主机发送请求消息,该请求消息用于请求服务名称和组件名称。而主机在接收到该请求消息之后,就可以将主机上部署的服务组件的服务名称和组件名称发送给故障发现装置,这样,故障发现装置可以获取到该服务名称和该组件名称。In an example, the host can obtain the service name and component name of the service component deployed on the host, and actively send the service name and the component name to the fault discovery device, so that the fault discovery device can obtain the service name and The name of the component. In another example, when the fault discovery device needs to perform fault discovery on the host, a request message can be sent to the host, the request message being used to request the service name and the component name. After receiving the request message, the host can send the service name and the component name of the service component deployed on the host to the fault finding device, so that the fault finding device can obtain the service name and the component name.
由于主机11部署HDFS服务的NameNode组件,因此,主机11处理的大数据业务对应的服务名称是HDFS服务,组件名称是NameNode组件,主机11可以将主机11的服务名称(如HDFS服务)、组件名称(如NameNode组件)发送给故障发现装置,故障发现装置获取到主机11的服务名称是HDFS服务,组件名称是NameNode组件。由于主机12部署HDFS服务的DataNode组件,因此,主机12处理的大数据业务的服务名称是HDFS服务,组件名称是DataNode组件,主机12可以将主机12的服务名称(如HDFS服务)、组件名称(如DataNode组件)发送给故障发现装置,故障发现装置获取到主机12的服务名称是HDFS服务,组件名称是DataNode组件。Since the host 11 deploys the NameNode component of the HDFS service, the service name corresponding to the big data service handled by the host 11 is the HDFS service, the component name is the NameNode component, and the host 11 can name the service of the host 11 (such as the HDFS service) and the component name. (For example, the NameNode component) is sent to the fault finding device, and the fault finding device obtains the service name of the host 11 as an HDFS service, and the component name is a NameNode component. Since the host 12 deploys the DataNode component of the HDFS service, the service name of the big data service handled by the host 12 is the HDFS service, the component name is the DataNode component, and the host 12 can name the service of the host 12 (such as the HDFS service) and the component name ( For example, the DataNode component is sent to the fault discovery device, and the fault discovery device obtains the service name of the host 12 as an HDFS service, and the component name is a DataNode component.
该步骤201可以设置为周期性执行,也可以设置为在满足预定条件时执行,也可以设置为响应于用户请求而执行,本公开对此不做任何限制。The step 201 may be set to be performed periodically, or may be set to be executed when a predetermined condition is met, or may be set to be executed in response to a user request, which is not limited in the present disclosure.
步骤202,故障发现装置从预先存储的多个配置文件中确定包括服务名称和组件名称的目标配置文件。Step 202: The fault finding device determines a target configuration file including a service name and a component name from a plurality of configuration files stored in advance.
在一个例子中,故障发现装置可以通过主机对应的该服务名称、该组件名称查询预先存储的多个配置文件,并从这多个配置文件中确定出包括该服务名称、该组件名称的目标配置文件。In an example, the fault finding device may query a plurality of pre-stored configuration files by using the service name and the component name corresponding to the host, and determine, from the plurality of configuration files, a target configuration including the service name and the component name. file.
例如,故障发现装置通过主机11对应的HDFS服务、NameNode组件查询多个配置文件,可以确定包括HDFS服务、NameNode组件的配置文件1,即配置文件1是目标配置文件。故障发现装置通过主机12对应的HDFS服务、DataNode组件查询多个配置文件,可以确定包括HDFS服务、DataNode组件的配置文件2,即配置文件2是目标配置文件。For example, the fault finding device queries the plurality of configuration files by using the HDFS service and the NameNode component corresponding to the host 11, and can determine the configuration file 1 including the HDFS service and the NameNode component, that is, the configuration file 1 is the target configuration file. The fault finding device queries the plurality of configuration files by using the HDFS service and the DataNode component of the host 12, and can determine the configuration file 2 including the HDFS service and the DataNode component, that is, the configuration file 2 is the target configuration file.
步骤203,故障发现装置将目标配置文件中包括的故障类型发送给主机,以使主机根据与故障类型对应的故障发现策略进行故障发现。Step 203: The fault discovery device sends the fault type included in the target configuration file to the host, so that the host performs fault discovery according to the fault discovery policy corresponding to the fault type.
该主机在接收到故障发现装置所发送的故障类型之后,可以执行如下的步骤A到C。After receiving the fault type sent by the fault finding device, the host may perform the following steps A to C.
步骤A,主机接收故障发现装置发送的目标配置文件包括的故障类型。Step A: The host receives the fault type included in the target configuration file sent by the fault finding device.
例如,故障发现装置可以将配置文件1包括的故障类型发送给主机11,并由该主机接收配置文件1包括的故障类型。For example, the fault finding device may transmit the fault type included in the configuration file 1 to the host 11, and the host receives the fault type included in the configuration file 1.
又例如,故障发现装置可以将配置文件2包括的故障类型发送给主机12,并由该主机接收配置文件2包括的故障类型。For another example, the fault finding device may transmit the fault type included in the configuration file 2 to the host 12, and the host receives the fault type included in the configuration file 2.
在一个例子中,故障发现装置可以生成故障探查计划1,该故障探查计划1可以携带配置文件1中的故障类型。故障发现装置将故障探查计划1发送给主机11,该主机在接收到故障探查计划1后,可以从故障探查计划1中解析出该故障类型。其中,故障探查计划1除了携带故障类型,还可以携带配置文件1中的其它内容,如标识、文件名称、描述信息、集群名称、服务名称、组件名称、告警方式等,对此故障探查计划1的内容不做限制。同理,故障发现装置还可以生成故障探查计划2,该故障探查计划2可以携带配置文件2中的故障类型,故障发现装置将故障探查计划2发送给主机12,该主机在接收到故障探查计划2后,可以从故障探查计划2中解析出该故障类型。In one example, the fault discovery device can generate a fault probing plan 1 that can carry the type of fault in profile 1. The fault finding device transmits the fault probing plan 1 to the host 11, and after receiving the fault probing plan 1, the host can parse the fault type from the fault probing plan 1. The fault detection plan 1 can carry other content in the configuration file 1, such as the identifier, the file name, the description information, the cluster name, the service name, the component name, and the alarm mode, in addition to the fault type, and the fault detection plan 1 The content is not restricted. Similarly, the fault finding device may also generate a fault probing plan 2, which may carry the fault type in the configuration file 2, and the fault finding device sends the fault probing plan 2 to the host 12, and the host receives the fault probing plan. After 2, the fault type can be resolved from the fault probing plan 2.
在一个例子中,故障发现装置可以周期性发送故障探查计划1/故障探查计划2,如每10秒发送一次故障探查计划1/故障探查计划2,对此发送周期不做限制。In one example, the fault finding device may periodically send the fault probing plan 1 / the fault probing plan 2, such as sending the fault probing plan 1 / the fault probing plan 2 every 10 seconds, and there is no restriction on the sending period.
步骤B,主机查询与该故障类型对应的故障发现策略。In step B, the host queries a fault discovery policy corresponding to the fault type.
步骤C,主机根据该故障发现策略进行故障发现。In step C, the host performs fault discovery according to the fault finding policy.
在一个例子中,在主机11上可以配置故障类型与故障发现策略的对应关系,如端口类型与故障发现策略1的对应关系,性能指标类型与故障发现策略2的对应关系。假设主机11获取到的故障类型是端口类型,则可以查询出与端口类型对应的故障发现策略1,并根据故障发现策略1进行故障发现,即检测主机11的端口是否存在故障,如主机11的端口是否DOWN。In an example, the correspondence between the fault type and the fault finding policy, such as the correspondence between the port type and the fault finding policy 1, and the corresponding relationship between the performance index type and the fault finding policy 2, can be configured on the host 11. It is assumed that the fault type obtained by the host 11 is a port type, and the fault discovery policy 1 corresponding to the port type can be queried, and fault discovery is performed according to the fault finding policy 1, that is, whether the port of the host 11 is faulty, such as the host 11 Whether the port is DOWN.
在另一个例子中,在主机12上可以配置故障类型与故障发现策略的对应关系,如端口类型与故障发现策略1的对应关系,性能指标类型与故障发现策略3(与上述的故障发现策略2不同)的对应关系。假设主机12获取到的故障类型是性能指标类型,则可以查询出与性能指标类型对应的故障发现策略3,并根据故障发现策略3进行故障发现,即检测主机12的性能指标是否存在故障,如CPU使用率是否达到阈值,内存使用率是否达到阈值等。In another example, the correspondence between the fault type and the fault discovery policy, such as the correspondence between the port type and the fault finding policy 1, the performance index type and the fault finding policy 3 (with the fault finding policy 2 described above), can be configured on the host 12. Correspondence of different). It is assumed that the fault type obtained by the host 12 is a performance indicator type, and the fault finding policy 3 corresponding to the performance indicator type can be queried, and the fault finding is performed according to the fault finding policy 3, that is, whether the performance index of the host 12 is faulty, such as Whether the CPU usage reaches the threshold, whether the memory usage reaches the threshold, and so on.
在一个例子中,对于故障发现策略1的内容不做限制,只要主机11根据故障发现策略1能够进行故障发现、主机12根据故障发现策略1能够进行故障发现即可。例如,故障发现策略1包括用于检测主机端口是否存在故障的配置信息、检测流程等,基于这些内容就可以检测主机的端口是否存在故障。此外,对于故障发现策略2、故障发现策略3的内容也不做限制,只要能够根据这些故障发现策略对主机进行故障发现即可,在此不再赘述。In one example, the content of the fault discovery strategy 1 is not limited as long as the host 11 can perform fault discovery according to the fault discovery policy 1, and the host 12 can perform fault discovery according to the fault discovery policy 1. For example, the fault discovery policy 1 includes configuration information for detecting whether a host port has a fault, a detection flow, and the like, and based on the content, it is possible to detect whether the host port has a fault. In addition, the content of the fault finding policy 2 and the fault finding policy 3 are not limited, as long as the fault finding of the host can be performed according to the fault finding policy, and details are not described herein again.
在主机根据故障发现策略进行故障发现之后,还可以涉及如下的故障恢复步骤D至F:After the host performs fault discovery according to the fault discovery policy, the following fault recovery steps D to F may also be involved:
步骤D、主机在发现已经发生故障时的处理过程。Step D: The process of the host when it is found that a fault has occurred.
在一个例子中,主机根据该故障发现策略进行故障发现之后,若发现主机已经发生故障,则确定该故障对应的故障特征和故障类型。然后,主机可以向故障发现装置发送故障消息,该故障消息用于通知主机发生故障,且该故障消息可以携带该故障特征和该故障类型。In an example, after the fault is discovered by the host according to the fault discovery policy, if the host is found to have failed, the fault feature and the fault type corresponding to the fault are determined. Then, the host may send a fault message to the fault finding device, the fault message is used to notify the host that the fault has occurred, and the fault message may carry the fault feature and the fault type.
上述故障特征可以包括但不限于以下内容之一或者任意组合:硬件特征、系统特征、服务组件特征、运行日志特征。其中,硬件特征可以是:主机的CPU特征(如CPU使用率)、内存特征(如内存使用率)、磁盘特征(如磁盘占用率)等,对此硬件特征不做限制。系统特征可以是:操作系统类型(如Windows、Linux等)、操作系统版本等,对此系统特征不做限制。服务组件 特征可以是:与服务组件有关的特征,如服务组件的端口是否开启、服务组件是否处于运行状态、服务组件的网络状态是否异常、服务组件是否能够处理请求等,对此服务组件特征不做限制。运行日志特征可以是:从运行日志中提取出的特征,如主机运行时间、主机运行的程序、主机的网络行为等,对此运行日志特征不做限制。当然,上述过程只是给出了故障特征的几个示例,对此故障特征也不做限制,所有与故障有关的特征均在本公开的保护范围之内。The above fault features may include, but are not limited to, one or any combination of the following: hardware features, system features, service component features, and operational log features. The hardware features may be: a CPU feature of the host (such as CPU usage), a memory feature (such as memory usage), and a disk feature (such as disk usage), and the hardware features are not limited. The system features can be: operating system type (such as Windows, Linux, etc.), operating system version, etc., and there are no restrictions on this system feature. The service component feature may be: a feature related to the service component, such as whether the port of the service component is enabled, whether the service component is in a running state, whether the network state of the service component is abnormal, whether the service component can process the request, etc., the service component feature is not Make restrictions. The characteristics of the running log can be: characteristics extracted from the running log, such as the running time of the host, the running program of the host, and the network behavior of the host. Of course, the above process only gives a few examples of fault characteristics, and the fault features are not limited, and all fault-related features are within the scope of the present disclosure.
例如,假设主机11根据“端口类型”对应的故障发现策略1进行故障发现时,发现主机11已经发生故障,则确定该故障对应的故障类型是“端口类型”,并根据主机11的当前状态获取该故障对应的故障特征,如主机11当前的CPU特征、内存特征、磁盘特征,主机11的操作系统类型和操作系统版本,与服务组件有关的特征,主机11的运行日志中的运行日志特征等。For example, if the host 11 detects the fault according to the fault discovery policy 1 corresponding to the port type, and finds that the host 11 has failed, it determines that the fault type corresponding to the fault is the port type, and obtains according to the current state of the host 11. The fault characteristics corresponding to the fault, such as the current CPU characteristics, memory characteristics, disk characteristics of the host 11, the operating system type and operating system version of the host 11, the characteristics related to the service component, the running log characteristics in the running log of the host 11, and the like .
又例如,假设主机12根据“性能指标类型”对应的故障发现策略3进行故障发现时,发现主机12已经发生故障,则确定该故障对应的故障类型是“性能指标类型”,并根据主机12的当前状态获取该故障对应的故障特征。For example, if the host 12 performs fault detection according to the fault discovery policy 3 corresponding to the performance indicator type, and finds that the host 12 has failed, it determines that the fault type corresponding to the fault is a “performance index type”, and according to the host 12 The current state acquires the fault feature corresponding to the fault.
步骤E、故障发现装置在发现主机已经发生故障时的处理过程。针对故障发现装置发现主机已经发生故障时的处理过程,可以采用如下三种方式中的一种方式处理。Step E: The process of detecting the fault when the fault finding device finds that the host has failed. The processing procedure when the fault finding device finds that the host has failed may be processed in one of the following three manners.
方式一、故障发现装置在接收到主机发送的故障消息后,根据目标配置文件包括的告警方式发送告警消息,该告警消息可以携带该目标配置文件包括的服务名称和组件名称、该主机的信息(如主机的IP地址、主机的标识等)。当然,该告警消息携带的内容并不局限于上述内容,如告警消息还可以携带目标配置文件包括的标识、文件名称、描述信息、集群名称等内容,对此不做限制。After receiving the fault message sent by the host, the fault discovery device sends an alarm message according to the alarm mode included in the target configuration file, where the alarm message may carry the service name and component name included in the target configuration file, and information about the host ( Such as the IP address of the host, the identity of the host, etc.). Of course, the content of the alarm message is not limited to the foregoing content. For example, the alarm message may also carry the identifier, the file name, the description information, the cluster name, and the like included in the target configuration file, and the content is not limited.
其中,配置文件包括的告警方式可以是WEB、EMAIL、SNMP中的一种或多种,因此故障发现装置可以通过目标配置文件包括的告警方式发送告警消息。The alarm mode included in the configuration file may be one or more of WEB, EMAIL, and SNMP. Therefore, the fault discovery device may send an alarm message by using an alarm manner included in the target configuration file.
例如,故障发现装置将配置文件1包括的故障类型发送给主机11后,若接收到主机11发送的故障消息,则根据配置文件1包括的告警方式发送告警消息,其中携带配置文件1包括的服务名称和组件名称、主机11的信息。For example, after the fault discovery device sends the fault type included in the configuration file 1 to the host 11, if the fault message sent by the host 11 is received, the alarm message is sent according to the alarm mode included in the configuration file 1, and the service included in the configuration file 1 is carried. Name and component name, information of host 11.
在一个例子中,故障发现装置在接收到主机发送的故障消息后,还可以在WEB页面展现目标配置文件包括的服务名称、组件名称、以及主机的信息等内容。In an example, after receiving the fault message sent by the host, the fault finding device may also display the content of the service name, the component name, and the host information included in the target configuration file on the WEB page.
方式二、故障发现装置在接收到主机发送的故障消息后,从该故障消息中解析出故障特征和故障类型。然后,故障发现装置通过该故障特征和该故障类型查询特征库,若所述特征库中存在与该故障特征和该故障类型匹配的故障恢复策略,则故障发现装置将该故障恢复策略发送给所述主机;若所述特征库中不存在与该故障特征和该故障类型匹配的故障恢复策略,则提示用户对所述主机的故障进行恢复。其中,该特征库可以位于故障发现装置本机,也可以位于该故障发现装置能够访问的任意其他装置。Manner 2: After receiving the fault message sent by the host, the fault finding device parses the fault feature and the fault type from the fault message. Then, the fault finding device queries the feature database by using the fault feature and the fault type. If there is a fault recovery strategy matching the fault feature and the fault type in the feature library, the fault finding device sends the fault recovery policy to the fault recovery device. If the fault recovery policy matches the fault feature and the fault type, the user is prompted to recover the fault of the host. The feature library may be located in the fault finding device local device or in any other device accessible by the fault finding device.
在一个例子中,故障发现装置可以建立特征库,该特征库用于相互关联地记录故障特征、故障类型、故障恢复策略,这个故障恢复策略可以理解为:当该故障类型的故障具有该故障特征时,则可以采用该故障恢复策略来恢复故障。如特征库可以相互关联地记录故障特征A、故障类型A、故障恢复策略A,相互关联地记录故障特征B、故障类型B、故障恢复策略B,以此类推。这样,当故障类型A的故障具有故障特征A时,则可以采用故障恢复策略A来恢复故障。In one example, the fault discovery device can establish a feature library for recording fault characteristics, fault types, and fault recovery strategies in association with each other. This fault recovery strategy can be understood as: when the fault type fault has the fault feature The failure recovery strategy can be used to recover the failure. For example, the feature library may record the fault feature A, the fault type A, the fault recovery strategy A in association with each other, record the fault feature B, the fault type B, the fault recovery strategy B in association with each other, and so on. Thus, when the fault of the fault type A has the fault feature A, the fault recovery strategy A can be used to recover the fault.
例如,故障发现装置在从故障消息中解析出故障特征A和故障类型A之后,由于特征库中存在与该故障特征A和该故障类型A匹配的故障恢复策略A,因此,故障发现装置将故障恢复策略A发送给主机。又例如,故障发现装置在从故障消息中解析出故障特征C和故障类型C后,由于特征库中不存在与该故障特征C和该故障类型C匹配的故障恢复策略,因此,故障发现装置提示用户对主机的故障进行恢复。For example, after the fault finding device resolves the fault feature A and the fault type A from the fault message, since the fault recovery strategy A matching the fault feature A and the fault type A exists in the feature library, the fault finding device will fail. Recovery policy A is sent to the host. For another example, after the fault finding device resolves the fault feature C and the fault type C from the fault message, since the fault recovery strategy that matches the fault feature C and the fault type C does not exist in the feature library, the fault finding device prompts The user recovers from the failure of the host.
在用户对主机的故障进行恢复之后,故障发现装置还可以获取用户对主机进行故障恢复时使用的故障恢复策略,并将获取的故障恢复策略与该故障特征、该故障类型相关联地记录在特征库中,从而不断更新特征库的内容。After the user recovers the fault of the host, the fault finding device may also acquire a fault recovery policy used by the user to recover the fault from the host, and record the obtained fault recovery policy in association with the fault feature and the fault type. In the library, the content of the signature library is continuously updated.
例如,由于特征库中不存在与故障特征C和故障类型C匹配的故障恢复策略,因此,故障发现装置提示用户对主机的故障进行恢复,假设用户采用故障恢复策略C对主机的故障进行恢复,在恢复完成后,主机可以将故障恢复策略C发送给故障发现装置。故障发现装置在获取到用户对主机进行故障 恢复时使用的故障恢复策略C后,将故障恢复策略C与故障特征C、故障类型C相关联地记录在特征库中。For example, because the fault recovery policy that matches the fault feature C and the fault type C does not exist in the feature library, the fault finding device prompts the user to recover the fault of the host, and assumes that the user uses the fault recovery policy C to recover the fault of the host. After the recovery is completed, the host can send the failure recovery policy C to the fault discovery device. After the fault finding device obtains the fault recovery policy C used by the user to recover the fault from the host, the fault recovery policy C is recorded in the feature database in association with the fault feature C and the fault type C.
方式三、故障发现装置在接收到故障消息后,采用方式一和方式二进行处理。Manner 3: After receiving the fault message, the fault finding device performs processing in mode one and mode two.
针对方式二和方式三,若特征库中存在与故障特征和故障类型匹配的故障恢复策略,故障发现装置将该故障恢复策略发送给主机后,主机还可以执行如下步骤:For the second mode and the third mode, if the fault recovery policy matches the fault feature and the fault type in the feature database, the fault discovery device sends the fault recovery policy to the host, and the host may perform the following steps:
步骤F、主机接收到故障恢复策略时的故障恢复处理过程。Step F: The fault recovery process when the host receives the fault recovery policy.
在一个例子中,主机可以接收故障发现装置发送的故障恢复策略,并根据故障恢复策略对主机当前的故障进行故障恢复。其中,当主机发生故障时,主机可以将所述故障对应的故障类型和故障特征发送给故障发现装置,且故障发现装置向该主机返回的这个故障恢复策略,是针对该故障特征和该故障类型的故障恢复策略,因此,这个故障恢复策略能够对与该故障特征和该故障类型匹配的故障进行恢复,也就是说,这个故障恢复策略能够对主机当前的故障进行故障恢复。In one example, the host may receive a failure recovery policy sent by the fault discovery device and perform a failure recovery on the current failure of the host according to the failure recovery policy. The host may send the fault type and the fault feature corresponding to the fault to the fault discovery device, and the fault recovery policy returned by the fault discovery device to the host is for the fault feature and the fault type. The fault recovery strategy, therefore, this fault recovery strategy can recover the fault that matches the fault feature and the fault type, that is, the fault recovery strategy can recover the fault of the current fault of the host.
在一个例子中,对于故障恢复策略的内容不做限制,只要主机能够根据故障恢复策略进行故障恢复即可。例如,故障恢复策略可以包括用于进行故障恢复的配置信息、恢复流程、恢复工具(如删除文件、更改配置、释放资源、重新挂载、重启)等,基于这些内容就可以对故障进行恢复,在此不再赘述。In one example, there is no restriction on the content of the failure recovery strategy, as long as the host can recover from the failure recovery policy. For example, the fault recovery strategy may include configuration information for failback, a recovery process, and recovery tools (such as deleting files, changing configurations, releasing resources, remounting, restarting), etc., based on which content can be recovered. I will not repeat them here.
基于上述技术方案,本公开实施例中,可以自动发现主机的故障,能够高效、便捷地发现主机的故障,从而实现大数据集群中主机故障的自动发现,能够解决大数据集群中监控运维复杂度高、故障发现难度大等问题。Based on the foregoing technical solution, in the embodiment of the present disclosure, the fault of the host can be automatically discovered, and the fault of the host can be found efficiently and conveniently, thereby realizing the automatic discovery of the host fault in the big data cluster, and solving the complex operation and maintenance of the big data cluster. High degree, difficulty in finding faults, etc.
此外,还可以自动恢复主机的故障,能够高效、便捷地恢复主机的故障,从而实现大数据集群中主机故障的自动恢复,能够解决大数据集群中监控运维复杂度高、故障恢复难度大等问题,从而提高主机的恢复效率。In addition, it can automatically recover the fault of the host, and can recover the fault of the host efficiently and conveniently, thus realizing the automatic recovery of the host fault in the big data cluster, which can solve the high complexity of monitoring and operation in the big data cluster, and the difficulty of fault recovery. The problem is to improve the recovery efficiency of the host.
基于与上述方法同样的构思,本公开实施例中还提出一种故障发现装置,参见图3所示,所述装置包括:Based on the same concept as the above method, a fault finding device is also proposed in the embodiment of the present disclosure. Referring to FIG. 3, the device includes:
获取模块301,用于获取大数据集群中的主机上部署的服务组件的服务名称和组件名称;The obtaining module 301 is configured to obtain a service name and a component name of a service component deployed on a host in the big data cluster;
确定模块302,用于从预先存储的多个配置文件中确定包括所述服务名称和所述组件名称的目标配置文件,其中所述配置文件包括相互关联地存储的服务名称、组件名称和故障类型;a determining module 302, configured to determine, from a plurality of pre-stored configuration files, a target configuration file including the service name and the component name, where the configuration file includes a service name, a component name, and a fault type stored in association with each other ;
发送模块303,用于将所述目标配置文件中包括的故障类型发送给所述主机,以使主机根据与故障类型对应的故障发现策略进行故障发现。The sending module 303 is configured to send the fault type included in the target configuration file to the host, so that the host performs fault discovery according to a fault discovery policy corresponding to the fault type.
在一个例子中,所述故障发现装置还包括接收模块(在图中未体现),用于接收所述主机发送的故障消息,所述故障消息用于通知所述主机发生故障。In one example, the fault finding apparatus further includes a receiving module (not shown in the figure) for receiving a fault message sent by the host, where the fault message is used to notify the host that a fault has occurred.
在一个例子中,所述发送模块303还用于在接收到所述主机发送的故障消息的情况下,根据所述目标配置文件中包括的告警方式发送告警消息,其中,所述告警消息包括所述目标配置文件中包括的服务名称和组件名称、以及所述主机的信息。In an example, the sending module 303 is further configured to send, according to an alarm manner included in the target configuration file, an alarm message, where the fault message is sent by the host, where the alarm message includes The service name and component name included in the target configuration file, and the information of the host.
在一个例子中,所述故障消息携带故障特征和故障类型;所述故障发现装置还包括检索模块(在图中未体现),用于检索与所述故障特征和故障类型匹配的故障恢复策略。所述发送模块303还用于在检索到所述故障恢复策略的情况下,将所述故障恢复策略发送给所述主机,以使主机根据故障恢复策略进行故障恢复。In one example, the fault message carries a fault feature and a fault type; the fault discovery device further includes a retrieval module (not shown in the figure) for retrieving a fault recovery strategy that matches the fault feature and the fault type. The sending module 303 is further configured to: if the fault recovery policy is retrieved, send the fault recovery policy to the host, so that the host performs fault recovery according to the fault recovery policy.
在一个例子中,所述故障发现装置还包括提示模块(在图中未体现),用于在所述故障发现装置没有检索到所述故障恢复策略的情况下,提示用户对所述主机的故障进行恢复。In one example, the fault finding apparatus further includes a prompting module (not shown in the figure) for prompting the user to fault the host if the fault finding apparatus does not retrieve the fault recovery policy. Carry out recovery.
在一个例子中,所述故障发现装置还包括记录模块(在图中未体现),用于获取用户对所述主机进行故障恢复时使用的故障恢复策略,并将获取的故障恢复策略与所述故障特征、所述故障类型相关联地进行记录。In an example, the fault finding apparatus further includes a recording module (not shown in the figure), configured to acquire a fault recovery policy used by the user to recover the host, and obtain the fault recovery policy and the The fault feature, the fault type is recorded in association.
本公开实施例提供的故障发现装置,从硬件层面而言,其硬件架构示意图具体可以参见图4所示。该故障发现装置可以包括存储有机器可执行指令的机器可读存储介质和处理器,其中处理器可以与该机器可读存储介质通信,通过读取并执行该机器可读存储介质中的机器可执行指令,可执行上文描述的故障发现方法。The fault discovery device provided by the embodiment of the present disclosure may be specifically shown in FIG. 4 according to a hardware architecture diagram. The fault discovery apparatus can include a machine readable storage medium storing a machine executable instruction and a processor, wherein the processor can communicate with the machine readable storage medium by reading and executing a machine in the machine readable storage medium Execution of the instructions, the fault discovery method described above can be performed.
这里,机器可读存储介质可以是任何电子、磁性、光学或其它物理存储装置,可以包含或存储信息,如可执行指令、数据,等等。例如,机器可读 存储介质可以是:RAM(Radom Access Memory,随机存取存储器)、易失存储器、非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、固态硬盘、任何类型的存储盘(如光盘、dvd等),或者类似的存储介质,或者它们的组合。Here, a machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and so forth. For example, the machine-readable storage medium may be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drive (such as a hard disk drive), solid state drive, any type of storage disk. (such as a disc, dvd, etc.), or a similar storage medium, or a combination thereof.
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The system, device, module or unit illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product having a certain function. A typical implementation device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver, and a game control. A combination of a tablet, a tablet, a wearable device, or any of these devices.
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本公开时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, the above devices are described separately by function into various units. Of course, the functions of the various units may be implemented in one or more software and/or hardware in the practice of the present disclosure.
本领域内的技术人员应明白,本公开的实施例可提供为方法、系统、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present disclosure can be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware aspects. Moreover, embodiments of the present disclosure may take the form of a computer program product embodied on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
本公开是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可以由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其它可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其它可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the production of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.
而且,这些计算机程序指令也可以存储在能引导计算机或其它可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程 图一个流程或者多个流程和/或方框图一个方框或者多个方框中指定的功能。Moreover, these computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The instruction means implements the functions specified in one or more blocks of the flowchart or in a flow or block diagram of the flowchart.
这些计算机程序指令也可装载到计算机或其它可编程数据处理设备上,使得在计算机或者其它可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其它可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
以上所述仅为本公开的实施例而已,并不用于限制本公开。对于本领域技术人员来说,本公开可以有各种更改和变化。凡在本公开的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本公开的权利要求范围之内。The above description is only for the embodiments of the present disclosure, and is not intended to limit the disclosure. Various changes and modifications of the present disclosure are possible to those skilled in the art. Any modifications, equivalents, improvements, etc. made within the spirit and scope of the present disclosure are intended to be included within the scope of the appended claims.

Claims (13)

  1. 一种故障发现方法,包括:故障发现装置获取大数据集群中的主机上部署的服务组件的服务名称和组件名称;所述故障发现装置从预先存储的多个配置文件中确定包括所述服务名称和所述组件名称的目标配置文件,其中所述配置文件包括相互关联地存储的服务名称、组件名称和故障类型;所述故障发现装置将所述目标配置文件中包括的故障类型发送给所述主机,以使所述主机根据与所述故障类型对应的故障发现策略进行故障发现。A fault finding method includes: a fault finding apparatus acquires a service name and a component name of a service component deployed on a host in a big data cluster; and the fault finding apparatus determines, from the plurality of configuration files stored in advance, the service name And a target configuration file of the component name, wherein the configuration file includes a service name, a component name, and a failure type stored in association with each other; the failure finding device transmits a failure type included in the target configuration file to the a host, so that the host performs fault discovery according to a fault discovery policy corresponding to the fault type.
  2. 根据权利要求1所述的故障发现方法,其中,所述故障发现方法还包括:所述故障发现装置接收所述主机发送的故障消息,所述故障消息用于通知所述主机发生故障。The fault discovery method according to claim 1, wherein the fault finding method further comprises: the fault finding device receiving a fault message sent by the host, the fault message being used to notify the host that a fault has occurred.
  3. 根据权利要求2所述的故障发现方法,其中,所述配置文件包括相互关联地存储的服务名称、组件名称、故障类型和告警方式,所述故障发现方法还包括:在接收到所述主机发送的故障消息的情况下,所述故障发现装置根据所述目标配置文件中包括的告警方式发送告警消息,其中,所述告警消息包括所述目标配置文件中包括的服务名称和组件名称、以及所述主机的信息。The fault finding method according to claim 2, wherein the configuration file includes a service name, a component name, a fault type, and an alarm mode stored in association with each other, and the fault finding method further includes: receiving the host sending In the case of a fault message, the fault discovery device sends an alert message according to an alert manner included in the target configuration file, where the alert message includes a service name and a component name included in the target configuration file, and a The information about the host.
  4. 根据权利要求2所述的故障发现方法,其中,所述故障消息包括故障特征和故障类型,所述故障发现方法还包括:所述故障发现装置检索与所述故障特征和故障类型匹配的故障恢复策略;以及在所述故障发现装置检索到所述故障恢复策略的情况下,所述故障发现装置将所述故障恢复策略发送给所述主机,以使所述主机根据所述故障恢复策略进行故障恢复。The fault discovery method according to claim 2, wherein the fault message includes a fault feature and a fault type, and the fault finding method further comprises: the fault finding device retrieving a fault recovery that matches the fault feature and the fault type a policy; and in the case that the fault finding device retrieves the fault recovery policy, the fault finding device sends the fault recovery policy to the host to cause the host to fail according to the fault recovery policy restore.
  5. 根据权利要求4所述的故障发现方法,其中,所述故障发现方法还包括:在所述故障发现装置没有检索到所述故障恢复策略的情况下,所述故障发现装置提示用户对所述主机的故障进行恢复。The fault discovery method according to claim 4, wherein the fault finding method further comprises: prompting the user to the host if the fault finding device does not retrieve the fault recovery policy The failure is restored.
  6. 根据权利要求5所述的故障发现方法,其中,所述故障发现方法还包括:The fault finding method according to claim 5, wherein the fault finding method further comprises:
    所述故障发现装置获取用户对所述主机进行故障恢复时使用的故障恢复策略,并将获取的故障恢复策略与所述故障特征、所述故障类型相关联地进行记录。The fault discovery device acquires a fault recovery policy used by the user to recover the fault from the host, and records the acquired fault recovery policy in association with the fault feature and the fault type.
  7. 一种故障发现装置,包括:A fault finding device comprising:
    处理器;以及Processor;
    存储有机器可执行指令的机器可读存储介质,a machine readable storage medium storing machine executable instructions,
    其中,通过读取并执行所述机器可执行指令,所述处理器被使得:Wherein, by reading and executing the machine executable instructions, the processor is caused to:
    获取大数据集群中的主机上部署的服务组件的服务名称和组件名称;Get the service name and component name of the service component deployed on the host in the big data cluster;
    从预先存储的多个配置文件中确定包括所述服务名称和所述组件名称的目标配置文件,其中所述配置文件包括相互关联地存储的服务名称、组件名称和故障类型;Determining, from a plurality of configuration files pre-stored, a target configuration file including the service name and the component name, wherein the configuration file includes a service name, a component name, and a failure type stored in association with each other;
    将所述目标配置文件中包括的故障类型发送给所述主机,以使所述主机根据与所述故障类型对应的故障发现策略进行故障发现。Sending the fault type included in the target configuration file to the host, so that the host performs fault discovery according to a fault discovery policy corresponding to the fault type.
  8. 根据权利要求7所述的故障发现装置,其中,所述机器可执行指令还促使所述处理器:The fault discovery device of claim 7 wherein said machine executable instructions further cause said processor to:
    接收所述主机发送的故障消息,所述故障消息用于通知所述主机发生故障。Receiving a fault message sent by the host, where the fault message is used to notify the host that a fault has occurred.
  9. 根据权利要求8所述的故障发现装置,其中,所述配置文件包括相互关联地存储的服务名称、组件名称、故障类型和告警方式,The fault discovery apparatus according to claim 8, wherein the configuration file includes a service name, a component name, a fault type, and an alarm mode stored in association with each other.
    所述机器可执行指令还促使所述处理器:The machine executable instructions also cause the processor to:
    在接收到所述主机发送的故障消息的情况下,根据所述目标配置文件中包括的告警方式发送告警消息,其中,所述告警消息包括所述目标配置文件中包括的服务名称和组件名称、以及所述主机的信息。Sending an alarm message according to an alarm manner included in the target configuration file, where the alarm message includes a service name and a component name included in the target configuration file, where the fault message is sent by the host, And information about the host.
  10. 根据权利要求8所述的故障发现装置,其中,所述故障消息包括故障特征和故障类型,The fault discovery device of claim 8, wherein the fault message includes a fault feature and a fault type.
    所述机器可执行指令还促使所述处理器:The machine executable instructions also cause the processor to:
    检索与所述故障特征和故障类型匹配的故障恢复策略;以及Retrieving a fault recovery strategy that matches the fault signature and fault type;
    在检索到所述故障恢复策略的情况下,将所述故障恢复策略发送给所述主机,以使所述主机根据所述故障恢复策略进行故障恢复。And in the case that the fault recovery policy is retrieved, the fault recovery policy is sent to the host, so that the host performs fault recovery according to the fault recovery policy.
  11. 根据权利要求10所述的故障发现装置,其中,所述机器可执行指令还促使所述处理器:The fault discovery device of claim 10 wherein said machine executable instructions further cause said processor to:
    在没有检索到所述故障恢复策略的情况下,提示用户对所述主机的故障进行恢复。If the fault recovery policy is not retrieved, the user is prompted to recover the fault of the host.
  12. 根据权利要求11所述的故障发现装置,其中,所述机器可执行指令还促使所述处理器:The fault discovery device of claim 11 wherein said machine executable instructions further cause said processor to:
    获取用户对所述主机进行故障恢复时使用的故障恢复策略,并将获取的故障恢复策略与所述故障特征、所述故障类型相关联地进行记录。Obtaining a fault recovery policy used by the user to perform failure recovery on the host, and recording the obtained fault recovery policy in association with the fault feature and the fault type.
  13. 一种机器可读存储介质,包括在被机器执行时使所述机器执行如下操作的指令:A machine readable storage medium comprising instructions that, when executed by a machine, cause the machine to:
    获取大数据集群中的主机上部署的服务组件的服务名称和组件名称;Get the service name and component name of the service component deployed on the host in the big data cluster;
    从预先存储的多个配置文件中确定包括所述服务名称和所述组件名称的目标配置文件,其中所述配置文件包括相互关联地存储的服务名称、组件名称和故障类型;Determining, from a plurality of configuration files pre-stored, a target configuration file including the service name and the component name, wherein the configuration file includes a service name, a component name, and a failure type stored in association with each other;
    将所述目标配置文件中包括的故障类型发送给所述主机,以使所述主机根据与所述故障类型对应的故障发现策略进行故障发现。Sending the fault type included in the target configuration file to the host, so that the host performs fault discovery according to a fault discovery policy corresponding to the fault type.
PCT/CN2018/091997 2017-06-21 2018-06-20 Fault discovery WO2018233630A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710474280.3 2017-06-21
CN201710474280.3A CN108289034B (en) 2017-06-21 2017-06-21 A kind of fault discovery method and apparatus

Publications (1)

Publication Number Publication Date
WO2018233630A1 true WO2018233630A1 (en) 2018-12-27

Family

ID=62831422

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/091997 WO2018233630A1 (en) 2017-06-21 2018-06-20 Fault discovery

Country Status (2)

Country Link
CN (1) CN108289034B (en)
WO (1) WO2018233630A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4221004A4 (en) * 2020-10-20 2024-02-21 Huawei Tech Co Ltd Method, apparatus, and system for determining fault recovery plan, and computer storage medium

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111158962B (en) * 2018-11-07 2023-10-13 中移信息技术有限公司 Remote disaster recovery method, device and system, electronic equipment and storage medium
CN112804072B (en) * 2019-11-14 2023-05-16 深信服科技股份有限公司 Fault information collection method and device, target electronic equipment and storage medium
CN110880990B (en) * 2019-11-29 2022-08-23 绿盟科技集团股份有限公司 Configuration checking method and device for big data cluster component and computing equipment
CN113055203B (en) * 2019-12-26 2023-04-18 中国移动通信集团重庆有限公司 Method and device for recovering exception of SDN control plane
CN111258851B (en) * 2020-01-14 2024-03-01 广州虎牙科技有限公司 Cluster alarm method, device, setting and storage medium
CN111459749A (en) * 2020-03-18 2020-07-28 平安科技(深圳)有限公司 Prometous-based private cloud monitoring method and device, computer equipment and storage medium
CN113760634A (en) * 2020-09-04 2021-12-07 北京沃东天骏信息技术有限公司 Data processing method and device
CN113407374A (en) * 2021-06-22 2021-09-17 未鲲(上海)科技服务有限公司 Fault processing method and device, fault processing equipment and storage medium
CN115134212B (en) * 2022-06-29 2024-04-19 中国工商银行股份有限公司 Policy pushing method, device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103812699A (en) * 2014-02-17 2014-05-21 无锡华云数据技术服务有限公司 Monitoring management system based on cloud computing
US20150039735A1 (en) * 2012-02-07 2015-02-05 Cloudera, Inc. Centralized configuration of a distributed computing cluster
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
CN106789398A (en) * 2016-11-25 2017-05-31 中国传媒大学 A kind of method of media big data hadoop cluster monitoring
CN106844132A (en) * 2015-12-03 2017-06-13 北京国双科技有限公司 The fault repairing method and device of cluster server

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593135A (en) * 2008-05-29 2009-12-02 国际商业机器公司 In distributed integrated environment, focus on the apparatus and method of business process failure
CN102882909B (en) * 2011-07-15 2015-05-06 易云捷讯科技(北京)有限公司 Cloud computing service monitoring system and method thereof
CN102916830B (en) * 2012-09-11 2013-12-11 北京航空航天大学 Implement system for resource service optimization allocation fault-tolerant management
CN103368771A (en) * 2013-06-24 2013-10-23 华为技术有限公司 Collecting method and device for fault site information of multi-node server system
CN103778031B (en) * 2014-01-15 2017-01-18 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN105515812A (en) * 2014-10-15 2016-04-20 中兴通讯股份有限公司 Fault processing method of resources and device
CN105630647A (en) * 2014-11-28 2016-06-01 中兴通讯股份有限公司 Equipment detection method and detection equipment
CN106130778A (en) * 2016-07-18 2016-11-16 浪潮电子信息产业股份有限公司 A kind of method processing clustering fault and a kind of management node
CN106341281A (en) * 2016-11-10 2017-01-18 福州智永信息科技有限公司 Distributed fault detection and recovery method of linux server

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150039735A1 (en) * 2012-02-07 2015-02-05 Cloudera, Inc. Centralized configuration of a distributed computing cluster
CN103812699A (en) * 2014-02-17 2014-05-21 无锡华云数据技术服务有限公司 Monitoring management system based on cloud computing
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
CN106844132A (en) * 2015-12-03 2017-06-13 北京国双科技有限公司 The fault repairing method and device of cluster server
CN106789398A (en) * 2016-11-25 2017-05-31 中国传媒大学 A kind of method of media big data hadoop cluster monitoring

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4221004A4 (en) * 2020-10-20 2024-02-21 Huawei Tech Co Ltd Method, apparatus, and system for determining fault recovery plan, and computer storage medium

Also Published As

Publication number Publication date
CN108289034B (en) 2019-04-09
CN108289034A (en) 2018-07-17

Similar Documents

Publication Publication Date Title
WO2018233630A1 (en) Fault discovery
US10977277B2 (en) Systems and methods for database zone sharding and API integration
US10997211B2 (en) Systems and methods for database zone sharding and API integration
US10496627B2 (en) Consistent ring namespaces facilitating data storage and organization in network infrastructures
US9251160B1 (en) Data transfer between dissimilar deduplication systems
JP6336675B2 (en) System and method for aggregating information asset metadata from multiple heterogeneous data management systems
US10860604B1 (en) Scalable tracking for database udpates according to a secondary index
US10949401B2 (en) Data replication in site recovery environment
US9110820B1 (en) Hybrid data storage system in an HPC exascale environment
CN107103011B (en) Method and device for realizing terminal data search
US10078655B2 (en) Reconciling sensor data in a database
US10262024B1 (en) Providing consistent access to data objects transcending storage limitations in a non-relational data store
BR112017001171B1 (en) METHOD PERFORMED ON A COMPUTING DEVICE, COMPUTING DEVICE AND COMPUTER READABLE MEMORY DEVICE TO RECOVER THE OPERABILITY OF A CLOUD-BASED SERVICE
US11366821B2 (en) Epsilon-closure for frequent pattern analysis
US11119866B2 (en) Method and system for intelligently migrating to a centralized protection framework
US20210099394A1 (en) Correlating network level and application level traffic
US10853892B2 (en) Social networking relationships processing method, system, and storage medium
US20190340359A1 (en) Malware scan status determination for network-attached storage systems
US10305753B2 (en) Supplementing log messages with metadata
WO2019072088A1 (en) File management method, file management device, electronic equipment and storage medium
US11509555B2 (en) Determining operational status of Internet of Things devices
US11500837B1 (en) Automating optimizations for items in a hierarchical data store
US11108730B2 (en) Group heartbeat information in a domain name system server text record
US20190377722A1 (en) Array structures
US11709845B2 (en) Federation of data during query time in computing systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18819693

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18819693

Country of ref document: EP

Kind code of ref document: A1