WO2018233630A1

WO2018233630A1 - Fault discovery

Info

Publication number: WO2018233630A1
Application number: PCT/CN2018/091997
Authority: WO
Inventors: 黄雷; 洪福成
Original assignee: 新华三大数据技术有限公司
Priority date: 2017-06-21
Filing date: 2018-06-20
Publication date: 2018-12-27
Also published as: CN108289034B; CN108289034A

Abstract

A fault discovery device obtains the service name and component name of a service component deployed on a host in a big data cluster; the fault discovery device determines a target configuration file comprising the service name and the component name from a plurality of pre-stored configuration files, wherein the configuration file comprises a service name, component name and fault type that are stored in association with each other; and the fault discovery device sends the fault type comprised in the target configuration file to the host so that the host carries out fault discovery according to a fault discovery policy corresponding to the fault type.

Description

Fault finding

Cross-reference to related applications

The present disclosure is based on and claims the benefit of priority to the benefit of the benefit of the benefit of the benefit of the benefit of the benefit of the benefit of the entire disclosure of

Background technique

Big data, also known as huge amount of data, has the following characteristics: large volume of data, such as the amount of data exceeding 10 terabytes, usually a large data set; large data categories, data from multiple data sources, rich in types and formats, such as structure Data, semi-structured data and unstructured data; data processing speed, in the case of large data volume, real-time data processing; high data authenticity, with social data, enterprise content, transactions, applications The rise of data requires effective information to ensure the authenticity and security of the data.

With the advent of the era of big data, big data brings convenience to users, and it also poses new challenges to operation and maintenance management. For example, in order to implement the related functions of big data, a large number of hosts need to be deployed in a big data cluster. How to efficiently and conveniently discover the faults of these hosts becomes a problem of operation and maintenance management.

DRAWINGS

FIG. 1 is a schematic diagram of an application scenario in an embodiment of the present disclosure;

2 is a flow chart of a fault finding method in an embodiment of the present disclosure;

3 is a functional block diagram of a fault finding apparatus in an embodiment of the present disclosure;

4 is a hardware configuration diagram of a fault finding apparatus in an embodiment of the present disclosure.

Detailed ways

The terms used in the embodiments of the present disclosure are for the purpose of describing the specific embodiments, and are not intended to limit the disclosure. The singular forms "a", "the" and "the" It should also be understood that the term "and/or" as used herein refers to any and all possible combinations of one or more of the associated listed items. Moreover, depending on the context, the word "if" may be interpreted to mean "at time" or "when" or "in response to determination."

In the embodiment of the present disclosure, a fault discovery method is proposed, which may be applied to a big data cluster (also referred to as a big data system), and the big data cluster may include multiple hosts for processing big data services. Each host deploys a service component and processes big data services through the service component.

Referring to FIG. 1, the big data cluster includes the host 11, the host 12, and the host 13, and the number of hosts in the actual application is more. In addition, each host can deploy service components for handling big data services, and the service components of different hosts can be the same or different.

For example, the host 11 deploys a NameNode component of the HDFS (Hadoop Distributed File System) service. Based on the NameNode component, the host 11 can implement the following big data services: managing data block mapping and processing client read. Write requests, configure copy policies, manage HDFS namespaces, and more. For another example, the host 12 deploys a DataNode component of the HDFS service. Based on the DataNode component, the host 12 can implement a big data service: storing a data block of the client, performing a data block read and write operation, and periodically sending heartbeat information to the NameNode.

Of course, the above process only gives several examples of service components. The actual application is not limited to this. For example, the host can deploy the split component, the sort component, the composite component, etc. of the MapReduce (Map Reduction) service, and deploy YARN ( Yet Another Resource Negotiator, another resource coordinator) resource manager component, application management component, etc., is not limited to this service component.

In the embodiment of the present disclosure, a fault finding device is also provided. The fault discovery device can be deployed on any host in a big data cluster or on any device outside the big data cluster. In addition, the fault finding device communicates with the host in the big data cluster to enable the fault finding device to perform fault finding and fault recovery on the host.

In the embodiment of the present disclosure, a plurality of configuration files may be pre-stored in the fault finding device local device or any other device accessible by the fault finding device, and each configuration file may include but is not limited to one or any of the following contents. Combination: ID, file name, description, cluster name, service name, component name, fault type, alarm mode, etc. The configuration file can be generated manually by the user, or can be generated by machine learning such as the failure occurrence history of various service components in the past. For example, if the fault finding device learns that N fault B has occurred for service component A, a configuration file including the service name and component name of service component A and the fault type of fault B may be automatically added.

The identifier may be a unique identifier of the configuration file. For example, there are two configuration files. The identifier of the first configuration file is 1, and the configuration file may be referred to as configuration file 1, and the identifier of the second configuration file is 2, and the configuration file can be referred to as configuration file 2 later.

The file name is the name of the configuration file and can be selected according to actual needs. The names of different configuration files may be the same or different. Moreover, the name of the configuration file may be Chinese, English, or other types of languages. The language of the name is not limited. For example, the name of profile 1 is Failure-finding_A and the name of profile 2 is Failure-finding_B.

The description information is a brief description of the configuration file, and can describe the function of the configuration file, the generation time of the configuration file, and the validity period of the configuration file. The description information is not limited.

The cluster name is the name of the big data cluster. For example, for the big data cluster composed of the host 11, the host 12, and the host 13, the cluster name may be "crs".

The service name is a service name corresponding to a service component for processing a big data service, such as an HDFS service, a MapReduce service, and a YARN service. The service name of the configuration file 1 is the HDFS service, and the service name of the configuration file 2 is the HDFS service.

The component name is a component name corresponding to a service component for processing a big data service, such as a NameNode component, a DataNode component, a split component, a sort component, a composite component, a resource manager component, an application management component, and the like. The component name of the configuration file 1 is the NameNode component, and the component name of the configuration file 2 is the DataNode component.

The fault type may include but is not limited to one or any combination of the following: port type (PORT), network type (WEB), performance indicator type (METRICS), and custom type (CUSTOM). The port type indicates whether the port of the host is faulty, such as whether the port is Down or not. The network type indicates whether the network of the host is faulty, such as whether the network is faulty, whether the network is reachable or not. The performance indicator type indicates whether the performance indicator of the host is detected. There is a fault, such as whether the CPU usage reaches the threshold, whether the memory usage reaches the threshold, etc.; the custom type is a fault type that allows the user to customize it, that is, the user can select the fault type to be detected according to actual needs.

The alarm mode may include but is not limited to one or any combination of the following: WEB, EMAIL, SNMP (Simple Network Management Protocol), and the like.

In one example, the above configuration file may be a file in the format of json (JavaScript Object Notation, JavaScript Object Markup Language), or may be in other formats, and is not limited thereto.

In one example, the fault discovery device can provide a Restful API (Representational State Transfer Application Programming Interface) that allows a third party to create a configuration file, modify a configuration file, and delete a configuration file.

Based on the foregoing application scenario, as shown in FIG. 2, the fault finding method of the embodiment of the present disclosure may include steps 201 to 203.

Step 201: The fault finding device acquires a service name and a component name of a service component deployed on a host in the big data cluster.

In an example, the host can obtain the service name and component name of the service component deployed on the host, and actively send the service name and the component name to the fault discovery device, so that the fault discovery device can obtain the service name and The name of the component. In another example, when the fault discovery device needs to perform fault discovery on the host, a request message can be sent to the host, the request message being used to request the service name and the component name. After receiving the request message, the host can send the service name and the component name of the service component deployed on the host to the fault finding device, so that the fault finding device can obtain the service name and the component name.

Since the host 11 deploys the NameNode component of the HDFS service, the service name corresponding to the big data service handled by the host 11 is the HDFS service, the component name is the NameNode component, and the host 11 can name the service of the host 11 (such as the HDFS service) and the component name. (For example, the NameNode component) is sent to the fault finding device, and the fault finding device obtains the service name of the host 11 as an HDFS service, and the component name is a NameNode component. Since the host 12 deploys the DataNode component of the HDFS service, the service name of the big data service handled by the host 12 is the HDFS service, the component name is the DataNode component, and the host 12 can name the service of the host 12 (such as the HDFS service) and the component name ( For example, the DataNode component is sent to the fault discovery device, and the fault discovery device obtains the service name of the host 12 as an HDFS service, and the component name is a DataNode component.

The step 201 may be set to be performed periodically, or may be set to be executed when a predetermined condition is met, or may be set to be executed in response to a user request, which is not limited in the present disclosure.

Step 202: The fault finding device determines a target configuration file including a service name and a component name from a plurality of configuration files stored in advance.

In an example, the fault finding device may query a plurality of pre-stored configuration files by using the service name and the component name corresponding to the host, and determine, from the plurality of configuration files, a target configuration including the service name and the component name. file.

For example, the fault finding device queries the plurality of configuration files by using the HDFS service and the NameNode component corresponding to the host 11, and can determine the configuration file 1 including the HDFS service and the NameNode component, that is, the configuration file 1 is the target configuration file. The fault finding device queries the plurality of configuration files by using the HDFS service and the DataNode component of the host 12, and can determine the configuration file 2 including the HDFS service and the DataNode component, that is, the configuration file 2 is the target configuration file.

Step 203: The fault discovery device sends the fault type included in the target configuration file to the host, so that the host performs fault discovery according to the fault discovery policy corresponding to the fault type.

After receiving the fault type sent by the fault finding device, the host may perform the following steps A to C.

Step A: The host receives the fault type included in the target configuration file sent by the fault finding device.

For example, the fault finding device may transmit the fault type included in the configuration file 1 to the host 11, and the host receives the fault type included in the configuration file 1.

For another example, the fault finding device may transmit the fault type included in the configuration file 2 to the host 12, and the host receives the fault type included in the configuration file 2.

In one example, the fault discovery device can generate a fault probing plan 1 that can carry the type of fault in profile 1. The fault finding device transmits the fault probing plan 1 to the host 11, and after receiving the fault probing plan 1, the host can parse the fault type from the fault probing plan 1. The fault detection plan 1 can carry other content in the configuration file 1, such as the identifier, the file name, the description information, the cluster name, the service name, the component name, and the alarm mode, in addition to the fault type, and the fault detection plan 1 The content is not restricted. Similarly, the fault finding device may also generate a fault probing plan 2, which may carry the fault type in the configuration file 2, and the fault finding device sends the fault probing plan 2 to the host 12, and the host receives the fault probing plan. After 2, the fault type can be resolved from the fault probing plan 2.

In one example, the fault finding device may periodically send the fault probing plan 1 / the fault probing plan 2, such as sending the fault probing plan 1 / the fault probing plan 2 every 10 seconds, and there is no restriction on the sending period.

In step B, the host queries a fault discovery policy corresponding to the fault type.

In step C, the host performs fault discovery according to the fault finding policy.

In an example, the correspondence between the fault type and the fault finding policy, such as the correspondence between the port type and the fault finding policy 1, and the corresponding relationship between the performance index type and the fault finding policy 2, can be configured on the host 11. It is assumed that the fault type obtained by the host 11 is a port type, and the fault discovery policy 1 corresponding to the port type can be queried, and fault discovery is performed according to the fault finding policy 1, that is, whether the port of the host 11 is faulty, such as the host 11 Whether the port is DOWN.

In another example, the correspondence between the fault type and the fault discovery policy, such as the correspondence between the port type and the fault finding policy 1, the performance index type and the fault finding policy 3 (with the fault finding policy 2 described above), can be configured on the host 12. Correspondence of different). It is assumed that the fault type obtained by the host 12 is a performance indicator type, and the fault finding policy 3 corresponding to the performance indicator type can be queried, and the fault finding is performed according to the fault finding policy 3, that is, whether the performance index of the host 12 is faulty, such as Whether the CPU usage reaches the threshold, whether the memory usage reaches the threshold, and so on.

In one example, the content of the fault discovery strategy 1 is not limited as long as the host 11 can perform fault discovery according to the fault discovery policy 1, and the host 12 can perform fault discovery according to the fault discovery policy 1. For example, the fault discovery policy 1 includes configuration information for detecting whether a host port has a fault, a detection flow, and the like, and based on the content, it is possible to detect whether the host port has a fault. In addition, the content of the fault finding policy 2 and the fault finding policy 3 are not limited, as long as the fault finding of the host can be performed according to the fault finding policy, and details are not described herein again.

After the host performs fault discovery according to the fault discovery policy, the following fault recovery steps D to F may also be involved:

Step D: The process of the host when it is found that a fault has occurred.

In an example, after the fault is discovered by the host according to the fault discovery policy, if the host is found to have failed, the fault feature and the fault type corresponding to the fault are determined. Then, the host may send a fault message to the fault finding device, the fault message is used to notify the host that the fault has occurred, and the fault message may carry the fault feature and the fault type.

The above fault features may include, but are not limited to, one or any combination of the following: hardware features, system features, service component features, and operational log features. The hardware features may be: a CPU feature of the host (such as CPU usage), a memory feature (such as memory usage), and a disk feature (such as disk usage), and the hardware features are not limited. The system features can be: operating system type (such as Windows, Linux, etc.), operating system version, etc., and there are no restrictions on this system feature. The service component feature may be: a feature related to the service component, such as whether the port of the service component is enabled, whether the service component is in a running state, whether the network state of the service component is abnormal, whether the service component can process the request, etc., the service component feature is not Make restrictions. The characteristics of the running log can be: characteristics extracted from the running log, such as the running time of the host, the running program of the host, and the network behavior of the host. Of course, the above process only gives a few examples of fault characteristics, and the fault features are not limited, and all fault-related features are within the scope of the present disclosure.

For example, if the host 11 detects the fault according to the fault discovery policy 1 corresponding to the port type, and finds that the host 11 has failed, it determines that the fault type corresponding to the fault is the port type, and obtains according to the current state of the host 11. The fault characteristics corresponding to the fault, such as the current CPU characteristics, memory characteristics, disk characteristics of the host 11, the operating system type and operating system version of the host 11, the characteristics related to the service component, the running log characteristics in the running log of the host 11, and the like .

For example, if the host 12 performs fault detection according to the fault discovery policy 3 corresponding to the performance indicator type, and finds that the host 12 has failed, it determines that the fault type corresponding to the fault is a “performance index type”, and according to the host 12 The current state acquires the fault feature corresponding to the fault.

Step E: The process of detecting the fault when the fault finding device finds that the host has failed. The processing procedure when the fault finding device finds that the host has failed may be processed in one of the following three manners.

After receiving the fault message sent by the host, the fault discovery device sends an alarm message according to the alarm mode included in the target configuration file, where the alarm message may carry the service name and component name included in the target configuration file, and information about the host ( Such as the IP address of the host, the identity of the host, etc.). Of course, the content of the alarm message is not limited to the foregoing content. For example, the alarm message may also carry the identifier, the file name, the description information, the cluster name, and the like included in the target configuration file, and the content is not limited.

The alarm mode included in the configuration file may be one or more of WEB, EMAIL, and SNMP. Therefore, the fault discovery device may send an alarm message by using an alarm manner included in the target configuration file.

For example, after the fault discovery device sends the fault type included in the configuration file 1 to the host 11, if the fault message sent by the host 11 is received, the alarm message is sent according to the alarm mode included in the configuration file 1, and the service included in the configuration file 1 is carried. Name and component name, information of host 11.

In an example, after receiving the fault message sent by the host, the fault finding device may also display the content of the service name, the component name, and the host information included in the target configuration file on the WEB page.

Manner 2: After receiving the fault message sent by the host, the fault finding device parses the fault feature and the fault type from the fault message. Then, the fault finding device queries the feature database by using the fault feature and the fault type. If there is a fault recovery strategy matching the fault feature and the fault type in the feature library, the fault finding device sends the fault recovery policy to the fault recovery device. If the fault recovery policy matches the fault feature and the fault type, the user is prompted to recover the fault of the host. The feature library may be located in the fault finding device local device or in any other device accessible by the fault finding device.

In one example, the fault discovery device can establish a feature library for recording fault characteristics, fault types, and fault recovery strategies in association with each other. This fault recovery strategy can be understood as: when the fault type fault has the fault feature The failure recovery strategy can be used to recover the failure. For example, the feature library may record the fault feature A, the fault type A, the fault recovery strategy A in association with each other, record the fault feature B, the fault type B, the fault recovery strategy B in association with each other, and so on. Thus, when the fault of the fault type A has the fault feature A, the fault recovery strategy A can be used to recover the fault.

For example, after the fault finding device resolves the fault feature A and the fault type A from the fault message, since the fault recovery strategy A matching the fault feature A and the fault type A exists in the feature library, the fault finding device will fail. Recovery policy A is sent to the host. For another example, after the fault finding device resolves the fault feature C and the fault type C from the fault message, since the fault recovery strategy that matches the fault feature C and the fault type C does not exist in the feature library, the fault finding device prompts The user recovers from the failure of the host.

After the user recovers the fault of the host, the fault finding device may also acquire a fault recovery policy used by the user to recover the fault from the host, and record the obtained fault recovery policy in association with the fault feature and the fault type. In the library, the content of the signature library is continuously updated.

For example, because the fault recovery policy that matches the fault feature C and the fault type C does not exist in the feature library, the fault finding device prompts the user to recover the fault of the host, and assumes that the user uses the fault recovery policy C to recover the fault of the host. After the recovery is completed, the host can send the failure recovery policy C to the fault discovery device. After the fault finding device obtains the fault recovery policy C used by the user to recover the fault from the host, the fault recovery policy C is recorded in the feature database in association with the fault feature C and the fault type C.

Manner 3: After receiving the fault message, the fault finding device performs processing in mode one and mode two.

For the second mode and the third mode, if the fault recovery policy matches the fault feature and the fault type in the feature database, the fault discovery device sends the fault recovery policy to the host, and the host may perform the following steps:

Step F: The fault recovery process when the host receives the fault recovery policy.

In one example, the host may receive a failure recovery policy sent by the fault discovery device and perform a failure recovery on the current failure of the host according to the failure recovery policy. The host may send the fault type and the fault feature corresponding to the fault to the fault discovery device, and the fault recovery policy returned by the fault discovery device to the host is for the fault feature and the fault type. The fault recovery strategy, therefore, this fault recovery strategy can recover the fault that matches the fault feature and the fault type, that is, the fault recovery strategy can recover the fault of the current fault of the host.

In one example, there is no restriction on the content of the failure recovery strategy, as long as the host can recover from the failure recovery policy. For example, the fault recovery strategy may include configuration information for failback, a recovery process, and recovery tools (such as deleting files, changing configurations, releasing resources, remounting, restarting), etc., based on which content can be recovered. I will not repeat them here.

Based on the foregoing technical solution, in the embodiment of the present disclosure, the fault of the host can be automatically discovered, and the fault of the host can be found efficiently and conveniently, thereby realizing the automatic discovery of the host fault in the big data cluster, and solving the complex operation and maintenance of the big data cluster. High degree, difficulty in finding faults, etc.

In addition, it can automatically recover the fault of the host, and can recover the fault of the host efficiently and conveniently, thus realizing the automatic recovery of the host fault in the big data cluster, which can solve the high complexity of monitoring and operation in the big data cluster, and the difficulty of fault recovery. The problem is to improve the recovery efficiency of the host.

Based on the same concept as the above method, a fault finding device is also proposed in the embodiment of the present disclosure. Referring to FIG. 3, the device includes:

The obtaining module 301 is configured to obtain a service name and a component name of a service component deployed on a host in the big data cluster;

a determining module 302, configured to determine, from a plurality of pre-stored configuration files, a target configuration file including the service name and the component name, where the configuration file includes a service name, a component name, and a fault type stored in association with each other ;

The sending module 303 is configured to send the fault type included in the target configuration file to the host, so that the host performs fault discovery according to a fault discovery policy corresponding to the fault type.

In one example, the fault finding apparatus further includes a receiving module (not shown in the figure) for receiving a fault message sent by the host, where the fault message is used to notify the host that a fault has occurred.

In an example, the sending module 303 is further configured to send, according to an alarm manner included in the target configuration file, an alarm message, where the fault message is sent by the host, where the alarm message includes The service name and component name included in the target configuration file, and the information of the host.

In one example, the fault message carries a fault feature and a fault type; the fault discovery device further includes a retrieval module (not shown in the figure) for retrieving a fault recovery strategy that matches the fault feature and the fault type. The sending module 303 is further configured to: if the fault recovery policy is retrieved, send the fault recovery policy to the host, so that the host performs fault recovery according to the fault recovery policy.

In one example, the fault finding apparatus further includes a prompting module (not shown in the figure) for prompting the user to fault the host if the fault finding apparatus does not retrieve the fault recovery policy. Carry out recovery.

In an example, the fault finding apparatus further includes a recording module (not shown in the figure), configured to acquire a fault recovery policy used by the user to recover the host, and obtain the fault recovery policy and the The fault feature, the fault type is recorded in association.

The fault discovery device provided by the embodiment of the present disclosure may be specifically shown in FIG. 4 according to a hardware architecture diagram. The fault discovery apparatus can include a machine readable storage medium storing a machine executable instruction and a processor, wherein the processor can communicate with the machine readable storage medium by reading and executing a machine in the machine readable storage medium Execution of the instructions, the fault discovery method described above can be performed.

Here, a machine-readable storage medium may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and so forth. For example, the machine-readable storage medium may be: RAM (Radom Access Memory), volatile memory, non-volatile memory, flash memory, storage drive (such as a hard disk drive), solid state drive, any type of storage disk. (such as a disc, dvd, etc.), or a similar storage medium, or a combination thereof.

The system, device, module or unit illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product having a certain function. A typical implementation device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver, and a game control. A combination of a tablet, a tablet, a wearable device, or any of these devices.

For the convenience of description, the above devices are described separately by function into various units. Of course, the functions of the various units may be implemented in one or more software and/or hardware in the practice of the present disclosure.

Those skilled in the art will appreciate that embodiments of the present disclosure can be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware aspects. Moreover, embodiments of the present disclosure may take the form of a computer program product embodied on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the production of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.

Moreover, these computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The instruction means implements the functions specified in one or more blocks of the flowchart or in a flow or block diagram of the flowchart.

These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

The above description is only for the embodiments of the present disclosure, and is not intended to limit the disclosure. Various changes and modifications of the present disclosure are possible to those skilled in the art. Any modifications, equivalents, improvements, etc. made within the spirit and scope of the present disclosure are intended to be included within the scope of the appended claims.

Claims

A fault finding method includes: a fault finding apparatus acquires a service name and a component name of a service component deployed on a host in a big data cluster; and the fault finding apparatus determines, from the plurality of configuration files stored in advance, the service name And a target configuration file of the component name, wherein the configuration file includes a service name, a component name, and a failure type stored in association with each other; the failure finding device transmits a failure type included in the target configuration file to the a host, so that the host performs fault discovery according to a fault discovery policy corresponding to the fault type.
The fault discovery method according to claim 1, wherein the fault finding method further comprises: the fault finding device receiving a fault message sent by the host, the fault message being used to notify the host that a fault has occurred.
The fault finding method according to claim 2, wherein the configuration file includes a service name, a component name, a fault type, and an alarm mode stored in association with each other, and the fault finding method further includes: receiving the host sending In the case of a fault message, the fault discovery device sends an alert message according to an alert manner included in the target configuration file, where the alert message includes a service name and a component name included in the target configuration file, and a The information about the host.
The fault discovery method according to claim 2, wherein the fault message includes a fault feature and a fault type, and the fault finding method further comprises: the fault finding device retrieving a fault recovery that matches the fault feature and the fault type a policy; and in the case that the fault finding device retrieves the fault recovery policy, the fault finding device sends the fault recovery policy to the host to cause the host to fail according to the fault recovery policy restore.
The fault discovery method according to claim 4, wherein the fault finding method further comprises: prompting the user to the host if the fault finding device does not retrieve the fault recovery policy The failure is restored.
The fault finding method according to claim 5, wherein the fault finding method further comprises:

The fault discovery device acquires a fault recovery policy used by the user to recover the fault from the host, and records the acquired fault recovery policy in association with the fault feature and the fault type.
A fault finding device comprising:

Processor;

a machine readable storage medium storing machine executable instructions,

Wherein, by reading and executing the machine executable instructions, the processor is caused to:

Get the service name and component name of the service component deployed on the host in the big data cluster;

Determining, from a plurality of configuration files pre-stored, a target configuration file including the service name and the component name, wherein the configuration file includes a service name, a component name, and a failure type stored in association with each other;

Sending the fault type included in the target configuration file to the host, so that the host performs fault discovery according to a fault discovery policy corresponding to the fault type.
The fault discovery device of claim 7 wherein said machine executable instructions further cause said processor to:

Receiving a fault message sent by the host, where the fault message is used to notify the host that a fault has occurred.
The fault discovery apparatus according to claim 8, wherein the configuration file includes a service name, a component name, a fault type, and an alarm mode stored in association with each other.

The machine executable instructions also cause the processor to:

Sending an alarm message according to an alarm manner included in the target configuration file, where the alarm message includes a service name and a component name included in the target configuration file, where the fault message is sent by the host, And information about the host.
The fault discovery device of claim 8, wherein the fault message includes a fault feature and a fault type.

The machine executable instructions also cause the processor to:

Retrieving a fault recovery strategy that matches the fault signature and fault type;

And in the case that the fault recovery policy is retrieved, the fault recovery policy is sent to the host, so that the host performs fault recovery according to the fault recovery policy.
The fault discovery device of claim 10 wherein said machine executable instructions further cause said processor to:

If the fault recovery policy is not retrieved, the user is prompted to recover the fault of the host.
The fault discovery device of claim 11 wherein said machine executable instructions further cause said processor to:

Obtaining a fault recovery policy used by the user to perform failure recovery on the host, and recording the obtained fault recovery policy in association with the fault feature and the fault type.
A machine readable storage medium comprising instructions that, when executed by a machine, cause the machine to:

Get the service name and component name of the service component deployed on the host in the big data cluster;

Determining, from a plurality of configuration files pre-stored, a target configuration file including the service name and the component name, wherein the configuration file includes a service name, a component name, and a failure type stored in association with each other;

Sending the fault type included in the target configuration file to the host, so that the host performs fault discovery according to a fault discovery policy corresponding to the fault type.