CN110677419A - Cluster detection method and device - Google Patents

Cluster detection method and device Download PDF

Info

Publication number
CN110677419A
CN110677419A CN201910940756.7A CN201910940756A CN110677419A CN 110677419 A CN110677419 A CN 110677419A CN 201910940756 A CN201910940756 A CN 201910940756A CN 110677419 A CN110677419 A CN 110677419A
Authority
CN
China
Prior art keywords
unit
resource information
fault
service
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910940756.7A
Other languages
Chinese (zh)
Inventor
范亚平
马申跃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Big Data Technologies Co Ltd
Original Assignee
New H3C Big Data Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Big Data Technologies Co Ltd filed Critical New H3C Big Data Technologies Co Ltd
Priority to CN201910940756.7A priority Critical patent/CN110677419A/en
Publication of CN110677419A publication Critical patent/CN110677419A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1433Vulnerability analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a cluster detection method and device. A cluster detection method comprises the following steps: acquiring component parameters of the service components from the cluster according to the acquired service component installation and deployment information; the component parameters include: target resource information required by the service component; acquiring unit resource information of a unit in the cluster, wherein the unit comprises at least one server, and the server in the unit deploys the service component; checking whether the target resource information and the unit resource information are matched with the set resource condition when the service assembly has a bug, wherein the resource condition is used for limiting the target resource information required by the service assembly and the unit resource information of the unit to which the service assembly belongs; and if so, determining that the service component has a vulnerability. The method can detect the loopholes of the service components in the cluster and provide maintenance basis for workers.

Description

Cluster detection method and device
Technical Field
The present application relates to big data processing technologies, and in particular, to a cluster detection method and apparatus.
Background
With the large increase of data volume, big data (big data) is more and more widely applied. By "big data", it is first meant that the amount of data (volumes) is large; the second is that there are many data types (variances), and the second is that the data processing speed (variance) is fast.
Currently, big data services are typically implemented by clustering. The cluster herein may also be referred to as a big data cluster, and mainly includes a plurality of servers. The servers can be divided into at least one group, each group comprises at least one server, and at least one service component for providing big data service, such as an HDFS (distributed file system), a Spark (distributed memory computing framework), and the like, is configured on the server of each group.
The existing cluster can manage big data, but cannot identify the vulnerability of the service component in the cluster, and warning is usually performed when the exception of the service component caused by the vulnerability of the service component affects the work of the whole cluster.
Disclosure of Invention
In view of the above technical problems, the present application provides a cluster detection method and apparatus to detect a vulnerability existing in a service component in a cluster. The technical scheme provided by the application is as follows:
in a first aspect, the present application provides a cluster detection method, including:
acquiring component parameters of the service components from the cluster according to the acquired service component installation and deployment information; the above component parameters include: target resource information required by the service component;
acquiring unit resource information of a unit in the cluster, wherein the unit comprises at least one server, and the server in the unit deploys the service component; the unit resource information includes: the sum of the idle resource information of each server in the set;
checking whether the target resource information and the unit resource information are matched with the set resource condition when the service component has a bug, wherein the resource condition is used for limiting the target resource information required by the service component and the unit resource information of the unit to which the service component belongs; and if so, determining that the service component has a vulnerability.
In a second aspect, the present application provides a cluster detection apparatus, including:
the component parameter acquiring unit is used for acquiring the component parameters of the service components from the cluster according to the acquired service component installation and deployment information; the above component parameters include: target resource information required by the service component;
a unit resource information obtaining unit, configured to obtain unit resource information of a unit in the cluster, where the unit includes at least one server, and the server in the unit deploys the service component; the unit resource information includes: the sum of the idle resource information of each server in the set;
a detecting unit, configured to check whether the target resource information and the unit resource information match a set resource condition when the service component has a vulnerability, where the resource condition is used to limit the target resource information required by the service component and the unit resource information of the unit to which the service component belongs; and if so, determining that the service component has a vulnerability.
According to the technical scheme, whether the service assembly has the loophole or not and what loophole exists can be determined by obtaining the assembly parameters of the service assembly and the unit resource information of the cluster unit and analyzing the assembly parameters and the unit resource information. Therefore, maintenance basis can be provided for workers.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flowchart of a cluster detection method provided in the present application;
FIG. 2 is a schematic diagram of the components of a cluster provided herein;
FIG. 3 is a flowchart of an implementation of step 103 provided by an embodiment of the present application;
FIG. 4 is a flow chart of another cluster detection method provided herein;
fig. 5 is a flowchart of another cluster detection method provided in the present application;
fig. 6 is a flow chart of unit resource prediction provided in the embodiment of the present application;
FIG. 7 is a schematic diagram of a storage resource available to a unit according to an embodiment of the present disclosure;
FIG. 8 is a flow chart of another unit resource prediction provided in the present application;
FIG. 9 is a schematic diagram of computing resources available to a unit according to an embodiment of the present application;
FIG. 10 is a flow diagram illustrating a health analysis of a service component provided by an embodiment of the present application;
fig. 11 is a structural diagram of a cluster detection apparatus provided in the present application;
fig. 12 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in detail below with reference to the accompanying drawings and specific embodiments.
Referring to fig. 1, fig. 1 is a flowchart of a cluster detection method provided in the present application. In one embodiment, the flow may be applied to one of the servers in the cluster. As an embodiment, the server may be a server that is pre-designated according to actual needs (referred to as a designated server for short).
As shown in fig. 1, the process may include the following steps:
step 101, an appointed server acquires component parameters of a service component from a cluster according to acquired service component installation and deployment information; the above component parameters include: target resource information required by the service component.
In one example, the service component installation deployment information may be configured in advance at the above-mentioned specified server. The service component installation deployment information here includes which servers the service components are deployed on. Based on this, e.g.
Step 101 describes that, the designated server may determine a server in which the service component is deployed from the cluster according to the configured service component installation and deployment information, and obtain component parameters of the service component from the determined server.
Referring to fig. 2, fig. 2 is a schematic diagram of a cluster, where the cluster in this step 101 includes a plurality of servers for providing big data services. Taking the service component as an HDFS (distributed file system) as an example, the specified server may determine the servers in which the HDFS is deployed according to the configured HDFS installation and deployment information, and then access the servers to obtain the component parameters of the HDFS.
In one example, the component parameters are described as step 101, which may include: target resource information required by the service component. For one embodiment, the target resource information required by the service component may be related resource information required for installation and operation of the service component.
In one example, the target resource information herein may include: storing resource information such as disk amount, memory amount, etc., and/or computing resources such as CPU utilization, etc.
Still taking the service component as the HDFS as an example, if the HDFS requires at least 1G of disk storage space during operation, the target resource information required by the HDFS may be at least 1G of disk storage space.
Step 102, a designated server obtains unit resource information of a unit in the cluster, the unit comprises at least one server, and the server in the unit deploys the service components.
As an embodiment, the servers in the cluster may be divided into at least one group, each group including at least one server. Fig. 2 shows, by way of example, the aggregates in a cluster.
The service component installation deployment information, as described above, may also include an identification of the crew in which the service component is located. Based on this, in this step 102, the designated server may determine the unit by using the unit identifier in the service component installation and deployment information, so as to obtain the unit resource information of the unit. Referring to fig. 2, in one example, if the crew is identified as crew 1 and the service component is component 1, the designated server may obtain crew resource information of crew 1.
In one example, the crew resource information here may be: and the sum of the idle resource information of each server in the unit. The sum of the free resource information may be the sum of free storage resources, such as free disk amount, free memory amount, etc., of each server in the set, and/or the sum of free computing resources, such as CPU non-utilization rate, etc., of each server in the set.
103, the appointed server checks whether the target resource information and the unit resource information are matched with the set resource condition when the service component has a bug, wherein the resource condition is used for limiting the target resource information required by the service component and the unit resource information of the unit to which the service component belongs; and if so, determining that the service component has a vulnerability.
As an embodiment, the service component presence vulnerability at least includes: the cluster security management method includes the steps of (1) vulnerability affecting cluster security, and/or a situation that a cluster is unstable due to improper configuration of service components (for example, performance risks such as cluster robustness and reliability appear).
To make service component configuration inappropriate easier to understand, the following examples describe examples of service component configuration inappropriate:
take the data node (DataNode) configured as the HDFS service component as an example of the damage tolerant disk number (dfs. If the number of damaged tolerant disks of the data node of the HDFS service component is configured to be 0, even if the cluster does not have any abnormality at this time, this is not in accordance with the big data usage concept, i.e. the service component is considered to be improperly configured (the number of damaged tolerant disks of the substantial data node), because: since the big data concept is to use a collection of servers for data storage and calculation, the use of the whole cluster is not affected when there is a single exception, and if the number of damaged-tolerant disks of the data node is configured to be 0, the use of the whole cluster is affected, so the service component is considered to be improperly configured (the number of damaged-tolerant disks of the substantial data node).
Taking the service component configuration as an example of a data node (DataNode) directory configuration: if the current disk Array is a Redundant Array of Independent Disks (RAID) composed of a plurality of (for example, 5) disks, if the directory configuration of the DataNode is configured as only one directory at this time, the service component configuration is considered to be improper because: because a plurality of directories can be completely configured for the DataNode according to a plurality of disks, according to a multi-copy system of the DataNode, even if at least one disk is abnormal, the use of the DataNode is not influenced, and the configuration of the DataNode directory is only configured as one directory, the risk is brought to the whole cluster when the configured directory is abnormal, so that the configuration of the service component (the configuration of the directory of the essential data node) is considered to be improper. In a specific implementation, there are many implementations of step 103, and one of the implementations is illustrated in fig. 3 below, which is not described herein for the moment.
Thus, the flow shown in fig. 1 is completed.
As can be seen from the flow shown in fig. 1, in the embodiment of the present application, target resource information required by a service component and unit resource information of a unit in which the service component is deployed are respectively obtained, and whether the service component has a vulnerability is checked according to the target resource information and the unit resource information, so as to avoid the security risk of a cluster and/or the problem of performance such as robustness and reliability of the cluster caused by improper configuration of the service component.
The following describes, by way of example, the above step 103:
in one example, step 103 can be implemented by means of a component parameter vulnerability profile library. The component parameter vulnerability feature library may be preset by a worker according to experience, or may be dynamically set by a designated server according to a vulnerability detection operation executed before, and the present application is not particularly limited.
Referring to fig. 3, fig. 3 is a flowchart of step 103 implemented by the present application. As shown in fig. 3, the process may include the following steps:
step 301, the specified server uses the component identifier of the service component as a keyword to search a vulnerability feature table item containing the keyword in a preset component parameter vulnerability feature library, wherein the vulnerability feature table item comprises the service component identifier and the resource condition when the service component has a vulnerability.
Step 302, the designated server checks whether the target resource information and the unit resource information are matched with the resource conditions in the vulnerability characteristic table entry, if so, the designated server determines that the target resource information and the unit resource information are matched with the resource conditions, and if not, the designated server determines that the target resource information and the unit resource information are not matched with the resource conditions.
Here, if the target resource information and the unit resource information match the resource condition, it is determined that the service component has a bug.
In one example, the resource condition in the vulnerability profile table entry is the basis for determining the vulnerability. For example, the resource conditions in the vulnerability profile table entry may be: and requiring the target resource information to meet certain conditions and requiring the unit resource information to meet certain conditions, and once the target resource information and the unit resource information both meet the corresponding conditions, considering that the target resource information and the unit resource information are matched with the resource conditions, so that the service component has a vulnerability.
As a simple example, if the target resource information is 10G of disk size and the unit resource information is 1G, the resource conditions in the vulnerability characteristic table entry may be: the target disk amount is larger than the unit disk amount. And by combining that the target resource information is 10G of disk volume (namely the target disk volume is 10G), the unit resource information is 1G (the unit disk volume is 1G), and the target disk volume 10G is far greater than the unit disk volume 1G, the resource conditions in the vulnerability characteristic table entry are met, namely, the service component is considered to have the vulnerability.
So far, how the specified server checks whether the target resource information and the unit resource information match the set resource conditions when the service component has a bug is realized through the flow shown in fig. 3. It should be noted that fig. 3 is only an example for implementing step 103, and is not intended to be limiting.
The above description is made by taking as an example that the component parameter includes target resource information required by the service component. In another example, the component parameters may also include other parameters related to the service component, such as the number of directories available under the service component.
When the component parameters include the number of available directories, how to detect the vulnerability of the service component is described with reference to fig. 4, and other situations are similar to fig. 1 or fig. 4, and are not described in detail here.
Referring to fig. 4, fig. 4 is a flowchart of another cluster detection method provided in the present application. As shown in fig. 4, the process may include the following steps:
step 401, the specified server searches the vulnerability characteristic table entry corresponding to the service component in the component parameter vulnerability characteristic library.
This step 401 is similar to the step 301, and is not described herein again.
Step 402, the designated server checks whether the number of the available directories meets the requirement of the number of the available directories in the vulnerability characteristic table entry, if so, the service component is determined to have a vulnerability, and if not, the service component is determined not to have the vulnerability.
For example, the number of available directories in the vulnerability profile table entry requires: and if the number of the available directories in the component parameters is less than 2 and is 1, the number of the available directories is checked to meet the requirement of the number of the available directories in the vulnerability characteristic table entry, and the service component has a vulnerability.
The flow shown in fig. 4 is completed.
Through the flow shown in fig. 4, how to detect the vulnerability of the service component based on the number of the available directories when the component parameters of the service component include the number of the available directories is realized.
In particular implementations, the service component generates a corresponding run log at run-time. As an embodiment, the present application may further detect whether the service component has a bug according to the running log of the service component, which is described below by the flow shown in fig. 5:
referring to fig. 5, fig. 5 is a flowchart of another cluster detection method provided in the present application. As shown in fig. 5, the process may include the following steps:
step 501, the appointed server obtains the running log of the service component according to the obtained service component log path information.
In one example, the service component log path information may be previously configured in the above-described specified server. Based on this, as described in step 501, the designated server may obtain the operation log of the service component according to the service component log path information. Here, the operation log of the service component records the entire operation condition of the service component.
Step 502, the designated server determines whether the service component has a bug according to a fault identifier for indicating a fault in the operation log.
As an embodiment, the fault identifier in this step is an "error" identifier in the running log, and when the "error" identifier is found to exist in the log, it is determined that the service component has a vulnerability. The corresponding error information under the error identifier can be recorded as vulnerability information.
The flow shown in fig. 5 is completed.
Through the flow shown in fig. 5, how to detect whether a vulnerability exists in a service component according to the running log of the service component is realized.
It should be noted that, as an embodiment, after any one of the processes in fig. 3 to fig. 5 detects that a vulnerability exists in a service component, a vulnerability report of the service component may also be output, so that a worker can repair the vulnerability of the service component according to the vulnerability report.
In the above description, how to detect the vulnerability of the service component is described, and how to predict the unit resource exhaustion time based on the unit resource information is described as follows:
in one example, the server is designated to perform the step 102 periodically or periodically to obtain the unit resource information of the units in the cluster at different time points.
As an embodiment, the unit resource information includes: the storage resource information may be used by the crew. Here, the storage resource information available to the group is the sum of the available storage spaces of the storage devices on all servers in the group.
Based on the information, the appointed server can predict the resource exhaustion time of the storage equipment in the unit according to the available storage resource information of the unit, which is acquired at different time points. A storage device herein refers to a device for storage, such as a magnetic disk, whose storage space is exhausted.
In the following, taking the storage device as a disk as an example, how to predict the time of resource exhaustion of the storage device based on the unit resource information is illustrated by fig. 6:
referring to fig. 6, fig. 6 is a flowchart of unit resource prediction provided in the embodiment of the present application. As shown in fig. 6, the process may include the following steps:
step 601, calculating the service cycle data volume stored in the unit in each service cycle in the time period from the first time point to the second time point.
In an example, the first time point is the earliest time point of the different time points, and the second time point is the latest time point of the different time points, and fig. 7 illustrates, by way of example, the information about the storage resources available to the crew, which is obtained in the time period from the first time point to the second time point. In fig. 7, the first time point is t1, and the second time point is tn.
In one example, the service period may be set according to actual service requirements, which may include N adjacent time points. For example, in fig. 7, t1 to t3 are one traffic period, t4 to t6 are another traffic period, and so on.
In this embodiment, the traffic cycle data amount is the sum of data amounts per unit time in the traffic cycle. Here, the data amount per unit time is a difference between the available storage resource information of the unit acquired at every two adjacent time points of the N adjacent time points in the service cycle.
Based on thisThrough step 601, the service period data volume of each different service period can be calculated, where the service period data volume of each different service period is sequentially recorded as: PV (photovoltaic)1、PV2、PV3……PVj
Step 602, predicting the time of disk resource exhaustion in the set according to the service period data volume of each service period and the information of the storage resources available for the set acquired at the second time point.
In actual work, the change of the data volume of the same unit in different service periods is usually smooth, and based on this, as an embodiment, before this step 602, the data volume of the service period of each service period may be further preprocessed to eliminate the service period volume that does not meet the requirement of smooth change.
As an embodiment, the preprocessing the data amount of the service period of each service period may include:
step a1, calculating the variation amplitude of the data volume of each service period. As an embodiment, the variation amplitude of the data amount of the service period can be represented by a tangent vector sine value of a curve between the data amount of the service period and the data amount of the service period of the next service period.
If the data volume of the jth service cycle is PVjThe data volume of the j +1 th service period is PVj+1The variation amplitude (denoted as sin) of the data volume of the jth service periodj) Can be calculated by the following formula:
by analogy, the variation amplitude of the data volume of the service period of each service period can be obtained, specifically: sin for medical use1、sin2、sin3……sinj
Step a2, removing the service period data quantity which does not satisfy the condition from the service period data quantity of all service periods according to the variation amplitude of each service period data quantity.
The variation amplitude of the data volume of the service period is still sin1、sin2、sin3……sinjThen step a2 may include:
step b1, calculating the change amplitude sin of the current existing service period data volume1、sin2、sin3……sinjAverage value P in between;
and b2, checking whether the difference between the current calculated average value P and the last calculated average value is less than a preset value, if so, ending the current process, and if not, executing the step b 3.
Step b3 from sin1、sin2、sin3……sinjSelecting the target value with the maximum absolute value of the difference value with P, and selecting the target value from PV1、PV2、PV3……PVjDeleting the service period data volume corresponding to the target value, and removing sin1、sin2、sin3……sinjThe target value is deleted and the process returns to step b 1.
So far, the steps b1, b2 and b3 realize how to eliminate the service cycle data volume which does not meet the condition from the service cycle data volumes of all service cycles according to the variation amplitude of the data volume of each service cycle.
After the service period data volume which does not satisfy the condition is removed from the service period data volumes of all the service periods according to the variation amplitude of each service period data volume in the step a2, the step a3 is executed:
step a3, predicting the time of disk resource exhaustion in the unit according to the existing service period data volume and the available storage resource information of the unit acquired at the second time point. Here, suppose that the existing traffic cycle data amounts are: PV (photovoltaic)1、PV2、PV3……PVn
As an embodiment, here, predicting the time for the disk resource exhaustion in the unit according to the existing service cycle data amount and the available storage resource information of the unit acquired at the second time point may include:
and c1, calculating the average value (marked as PVA) of the data volume of each service period. Recording the total number of the existing service period data volume as n, the average value PVA of the service period data volume can be calculated by the following formula:
Figure BDA0002222803560000112
and c2, calculating the time of the disk resource exhaustion in the unit by using the average value PVA of the data volume of each service period and the unit available storage resource information (marked as DC) acquired at the second time point.
Assuming that each service period comprises x unit times, the time T for the disk resource exhaustion in the set can be calculated by the following formula:
Figure BDA0002222803560000111
so far, the steps c1 and c2 realize that the time of the disk resource exhaustion in the unit is predicted according to the existing service period data volume and the available storage resource information of the unit acquired at the second time point.
Thus, the description of the flow shown in fig. 6 is completed.
Through the flow shown in fig. 6, the time when the storage device resources are exhausted is predicted. And the capacity of the storage equipment can be expanded in time through the predicted time, and the reliability of the unit is improved.
The above is that the unit resource information includes: the set may use the storage resource information as an example description. In another embodiment, the crew resource information may further include: the crew may use the computing resource information and/or the crew may use the memory resource information. Here, the unit available computing resource information is the sum of the available CPU utilization of the disks on all the servers in the unit. The unit usable memory resource information is the sum of the usable memory resources of the disks on all the servers in the unit.
In the following, the description is given by taking the example that the unit resource information includes the unit usable computing resource information, and the unit usable memory resource information is similar to the unit usable computing resource information, and is not described again:
referring to fig. 8, fig. 8 is a flowchart of another unit resource prediction provided in the embodiment of the present application. As shown in fig. 8, the process may include the following steps:
step 801, in the unit-available computing resource information acquired by the designated server at each different time point, counting the target unit-available computing resource information of which the unit-available computing resource information exceeds the set threshold.
In one example, the first time point is an earliest time point among the different time points, and the second time point is a latest time point among the different time points. Fig. 9 illustrates, by way of example, a time period from a first time point to a second time point, where t1 is the first time point and tn is the second time point, and each time point corresponds to the information about the available computing resources of the crew group acquired at the time point.
In one example, the information of the computing resources available to the statistical target group may include: and counting the time points (marked as target time points) of the corresponding target unit which can use the computing resource information.
Step 802, if the proportion of the computing resource information that can be used by the target unit in the obtained computing resource information that can be used by all units exceeds the set proportion, it is determined that capacity expansion is needed.
In one example, all of the crew available computing resource information may be represented by a total number of different time points between the first time and the second time. Therefore, the proportion of the number of the target time points to the total number of different time points can be calculated, so that the proportion of the computing resource information which can be used by the target unit and is occupied in the obtained computing resource information which can be used by all the units can be calculated.
The flow shown in fig. 8 is completed.
Through the process shown in fig. 8, how to expand the available computing resources of the unit in time when the unit resource information includes the information of the available computing resources of the unit is realized. In the above description, how to predict the unit resource exhaustion time based on the unit resource information is described, and how to determine the health index of the service component based on the operation log is described below:
referring to fig. 10, fig. 10 is a flowchart illustrating a health analysis of a service component according to an embodiment of the present disclosure.
The process may include the steps of:
step 1001, the designated server identifies the fault information associated with each fault identifier in the operation log.
In one example, the fault identifier is an "error" identifier in the operation log, and the designated server can identify fault information from corresponding error information under the "error" identifier. The identified fault information includes a fault type.
Step 1002, the designated server determines the fault level matched with each fault information.
In one example, the designated server searches a service component fault table entry containing the fault information in a preset service component fault library by taking the fault information as a key for each fault information.
And if the service assembly fault table item matched with the fault information is found, determining the fault level in the service assembly fault table item as the fault level matched with the fault information.
If the service component fault entry matched with the fault information is not found, setting a matched fault level for the fault information according to the fault type in the fault information and the indication information related to the fault information recovery in the operation log.
As an embodiment, here, setting a matching failure level for the failure information according to the failure type in the failure information and the indication information on the failure information recovery in the operation log may include:
identifying a fault type in the fault information;
if the fault type is the first type, searching time information of fault recovery normal corresponding to the fault type from the operation log, and if the fault recovery normal corresponding to the fault type is determined within the specified time according to the time information, determining that the grade is the first grade. And if the fault corresponding to the fault type is determined not to be recovered to be normal within the specified time according to the time information, the grade is a second grade.
Here, the first type is used to indicate that a service component is abnormal but the abnormality does not affect other service components, such as for HDFS components, the first type of failure may be a failure of a DataNode.
Continuing with the example of the HDFS component, assuming that its failure type is a first type, the failure information may be determined to be of a first level if the failure is recovered within 10 minutes, otherwise, the failure information may be determined to be of a second level.
If the fault type is the second type, searching time information of the fault recovery corresponding to the fault type from the running log, and if the fault recovery corresponding to the fault type is determined to be normal within the specified time according to the time information, determining that the grade is the second grade. And if the fault corresponding to the fault type is determined not to be recovered to be normal within the specified time according to the time information, the grade is a third grade.
Here, the second type is used to indicate that a service component is abnormal but the abnormality affects other service components, for example, for an HDFS component, the failure of the second type may be a failure of a NameNode (name node) or a second NameNode (second name node).
Continuing with the example of the HDFS component, assuming that its failure type is of a second type, the failure information may be determined to be of a second level if the failure is recovered within 10 minutes, otherwise, the failure information may be determined to be of a third level.
Step 1003, the designated server determines the health index of the service component according to the fault level matched with each fault message.
As an example, the method may calculate the health score of the service component as the health index of the service component according to the fault level matched with each fault message, and includes the following steps:
step d1, calculate the score for each fault message.
Recording the matched fault grade of the ith fault information as FiAssuming that n pieces of fault information exist in total, the fault level corresponding to each piece of fault information is as follows: f1,F2,F3......Fn. The ith fault messageScore of information (denoted as S)i) Can be calculated by the following formula:
Figure BDA0002222803560000141
and the value of each fault point is obtained by analogy: s1,S2,S3......Sn
Step d2, calculating the average value (designated as Score) of the Score of each failure point obtained in d1 as the health Score of the service component, and calculating the average value of the Score of each failure point obtained in d1 by the following formula:
Figure BDA0002222803560000142
this results in a health Score for the service component.
So far, through the steps d1 and d2, the health index of the service component is determined according to the fault level matched with each fault information.
The flow shown in fig. 10 is completed.
Through the flow shown in fig. 10, the determination of the health index of the service component based on the operation log of the service component is realized, and a reference is provided for the maintenance of the service component by the worker.
The method provided by the present application is described above, and the device provided by the present application is described below:
referring to fig. 11, fig. 11 is a structural diagram of a cluster detection apparatus provided in the present application. As shown in fig. 11, the cluster detection apparatus includes a component parameter obtaining unit, a unit resource information obtaining unit, and a detection unit.
In an example, the component parameter obtaining unit 1101 is configured to obtain a component parameter of a service component from a cluster according to the obtained service component installation and deployment information; the above component parameters include: target resource information required by the service component.
A unit resource information obtaining unit 1102, configured to obtain unit resource information of a unit in the cluster, where the unit includes at least one server, and the server in the unit deploys the service component; the unit resource information includes: and the sum of the idle resource information of each server in the set.
A detecting unit 1103, configured to check whether the target resource information and the unit resource information match a set resource condition when the service component has a vulnerability, where the resource condition is used to limit the target resource information required by the service component and the unit resource information of the unit to which the service component belongs. And if so, determining that the service component has a vulnerability.
As an embodiment, the above apparatus further comprises: a component log obtaining unit.
The component unit can be used for acquiring the running log of the service component according to the acquired service component log path information.
In this embodiment, the detecting unit 1103 is further configured to determine whether the service component has a bug according to a fault identifier indicating a fault in the operation log.
As an embodiment, the above apparatus further comprises: a health indicator determination unit.
The health index determining unit can be used for identifying fault information associated with each fault identifier in the running log;
determining the fault grade matched with each fault information;
and determining the health index of the service component according to the fault grade matched with each fault message.
As an embodiment, the unit resource information acquired by the unit resource information acquiring unit 1102 includes: the storage resource information may be used by the crew. Here, the storage resource information available to the unit may be a sum of available storage spaces of storage devices on all servers in the unit.
In this embodiment, the cluster detection apparatus further includes: and a prediction unit.
The prediction unit may be configured to predict a time for which the storage device resource in the unit is exhausted according to the unit available storage resource information of the unit acquired at different time points.
Thus, the description of the apparatus shown in fig. 11 is completed.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
Correspondingly, the embodiment of the application also provides the electronic equipment. Referring to fig. 12, fig. 12 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application. As shown in fig. 12, the electronic device may include a processor 1201 and a memory 1202. Wherein the memory 1202 has stored thereon a computer program; the processor 1201 may perform the cluster detection method described above by executing the program stored on the memory 1202.
The memory 1202, as referred to herein, may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, memory 1202 may be: RAM (random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, dvd, etc.), or similar storage medium, or a combination thereof.
Based on the same application concept as the method described above, the embodiment of the present application further provides a machine-readable storage medium, such as the memory 1202 in fig. 12, where the computer program is executable by the processor 1201 in the electronic device shown in fig. 12 to implement the cluster detection method described above.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Furthermore, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (12)

1. A method for cluster detection, the method comprising:
acquiring component parameters of the service components from the cluster according to the acquired service component installation and deployment information; the component parameters include: target resource information required by the service component;
acquiring unit resource information of a unit in the cluster, wherein the unit comprises at least one server, and the server in the unit deploys the service component; the unit resource information comprises: the sum of the idle resource information of each server in the unit;
checking whether the target resource information and the unit resource information are matched with the set resource condition when the service assembly has a bug, wherein the resource condition is used for limiting the target resource information required by the service assembly and the unit resource information of the unit to which the service assembly belongs; and if so, determining that the service component has a vulnerability.
2. The method according to claim 1, wherein the checking whether the target resource information and the crew resource information match the set resource condition when the service component has a vulnerability includes:
searching vulnerability characteristic table items containing the keywords in a preset component parameter vulnerability characteristic library by taking the component identification of the service component as the keyword; the vulnerability characteristic table entry comprises a service component identifier and the resource condition when the service component has a vulnerability;
checking whether the target resource information and the unit resource information are matched with the resource conditions in the vulnerability characteristic table entry, if so, determining that the target resource information and the unit resource information are matched with the resource conditions, and if not, determining that the target resource information and the unit resource information are not matched with the resource conditions.
3. The method of claim 1, wherein the component parameters further comprise: the number of the configured available catalogs under the service component;
the method further comprises the following steps:
searching a vulnerability characteristic table entry corresponding to the service component in a preset component parameter vulnerability characteristic library;
checking whether the number of the available catalogs meets the requirement of the number of the available catalogs in the vulnerability characteristic table entry, if so, determining that the vulnerability exists in the service assembly, and if not, determining that the vulnerability does not exist in the service assembly.
4. The method of claim 1, further comprising:
acquiring a running log of the service component according to the acquired log path information of the service component;
and determining whether the service component has a vulnerability according to the fault identification used for indicating the fault in the running log.
5. The method according to any one of claims 1 to 4, wherein the crew resource information comprises: the unit can use the storage resource information; the unit usable storage resource information is the sum of usable storage spaces of storage devices on all servers in the unit;
the method further comprises the following steps: and predicting the time for the resource exhaustion of the storage equipment in the unit according to the available storage resource information of the unit, which is acquired at different time points.
6. The method of claim 5, wherein the predicting the time for the storage device resource exhaustion in the unit according to the information about the storage resource available to the unit of the unit acquired at different time points comprises:
calculating the data volume of a service cycle stored in the unit in each service cycle in a time period from a first time point to a second time point, wherein the first time point is the earliest time point of the different time points, the second time point is the latest time point of the different time points, the service cycle comprises N adjacent time points, and the data volume of the service cycle is the sum of the data volume of each unit time in the service cycle; the data volume of each unit time is the difference between the usable storage resource information of the unit acquired at every two adjacent time points in the N adjacent time points in the service period;
and predicting the time for the resource exhaustion of the storage equipment in the unit according to the service period data volume of each service period and the unit available storage resource information acquired at the second time point.
7. The method of claim 4, further comprising:
identifying fault information associated with each fault identifier in the running log;
determining the fault grade matched with each fault information;
and determining the health index of the service component according to the fault grade matched with each fault message.
8. The method of claim 7, wherein determining the fault level at which each fault message matches comprises:
for each fault information, searching a service component fault table item containing the fault information in a preset service component fault library by taking the fault information as a keyword;
if the service component fault list item is found, determining the fault level in the service component fault list item as the fault level matched with the fault information;
if the fault information is not found, setting a matched fault grade for the fault information according to the fault type in the fault information and the indication information related to the fault information recovery in the operation log.
9. A cluster detection apparatus, comprising:
the component parameter acquiring unit is used for acquiring the component parameters of the service components from the cluster according to the acquired service component installation and deployment information; the component parameters include: target resource information required by the service component;
a unit resource information obtaining unit, configured to obtain unit resource information of a unit in the cluster, where the unit includes at least one server, and the server in the unit deploys the service component; the unit resource information comprises: the sum of the idle resource information of each server in the unit;
the detection unit is used for checking whether the target resource information and the unit resource information are matched with the set resource condition when the service component has a vulnerability, wherein the resource condition is used for limiting the target resource information required by the service component and the unit resource information of the unit to which the service component belongs; and if so, determining that the service component has a vulnerability.
10. The apparatus of claim 9,
the device further comprises: the component log obtaining unit is used for obtaining the running log of the service component according to the obtained service component log path information;
the detection unit is further configured to determine whether the service component has a bug according to a fault identifier used for indicating a fault in the operation log.
11. The apparatus of claim 9, wherein the crew resource information comprises: the unit can use the storage resource information; the unit usable storage resource information is the sum of usable storage spaces of storage devices on all servers in the unit;
the apparatus further comprises:
and the prediction unit is used for predicting the time of resource exhaustion of the storage equipment in the unit according to the available storage resource information of the unit, acquired at different time points, of the unit.
12. The apparatus of claim 10, further comprising:
the health index determining unit is used for identifying fault information associated with each fault identifier in the running log;
determining the fault grade matched with each fault information;
and determining the health index of the service component according to the fault grade matched with each fault message.
CN201910940756.7A 2019-09-30 2019-09-30 Cluster detection method and device Pending CN110677419A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910940756.7A CN110677419A (en) 2019-09-30 2019-09-30 Cluster detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910940756.7A CN110677419A (en) 2019-09-30 2019-09-30 Cluster detection method and device

Publications (1)

Publication Number Publication Date
CN110677419A true CN110677419A (en) 2020-01-10

Family

ID=69080570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910940756.7A Pending CN110677419A (en) 2019-09-30 2019-09-30 Cluster detection method and device

Country Status (1)

Country Link
CN (1) CN110677419A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108874640A (en) * 2018-05-07 2018-11-23 北京京东尚科信息技术有限公司 A kind of appraisal procedure and device of clustering performance
CN109308245A (en) * 2018-09-07 2019-02-05 郑州市景安网络科技股份有限公司 A kind of server resource method for early warning, device, equipment and readable storage medium storing program for executing
CN109408302A (en) * 2017-08-16 2019-03-01 阿里巴巴集团控股有限公司 A kind of fault detection method, device and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408302A (en) * 2017-08-16 2019-03-01 阿里巴巴集团控股有限公司 A kind of fault detection method, device and electronic equipment
CN108874640A (en) * 2018-05-07 2018-11-23 北京京东尚科信息技术有限公司 A kind of appraisal procedure and device of clustering performance
CN109308245A (en) * 2018-09-07 2019-02-05 郑州市景安网络科技股份有限公司 A kind of server resource method for early warning, device, equipment and readable storage medium storing program for executing

Similar Documents

Publication Publication Date Title
JP4318643B2 (en) Operation management method, operation management apparatus, and operation management program
CN112579327B (en) Fault detection method, device and equipment
CN107544832B (en) Method, device and system for monitoring process of virtual machine
JP2019523952A (en) Streaming data distributed processing method and apparatus
CN105573859A (en) Data recovery method and device of database
JP6528669B2 (en) Predictive detection program, apparatus, and method
CN102446217A (en) Complex event processing apparatus and complex event processing method
US9658908B2 (en) Failure symptom report device and method for detecting failure symptom
JP6411696B1 (en) Version control system and version control method
US9870314B1 (en) Update testing by build introspection
CN106570091B (en) Method for enhancing high availability of distributed cluster file system
CN115756955A (en) Data backup and data recovery method and device and computer equipment
CN110018932B (en) Method and device for monitoring container magnetic disk
JP6252309B2 (en) Monitoring omission identification processing program, monitoring omission identification processing method, and monitoring omission identification processing device
CN114721594A (en) Distributed storage method, device, equipment and machine readable storage medium
CN113326064A (en) Method for dividing business logic module, electronic equipment and storage medium
CN112436962A (en) Block chain consensus network dynamic expansion method, electronic device, system and medium
CN109039695B (en) Service fault processing method, device and equipment
CN110677419A (en) Cluster detection method and device
CN111427871A (en) Data processing method, device and equipment
CN111324518A (en) Application association method and device
CN115543918A (en) File snapshot method, system, electronic equipment and storage medium
CN112685390B (en) Database instance management method and device and computing equipment
CN109254880A (en) A kind of method and device handling database delay machine
CN114265813A (en) Snapshot query method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200110

RJ01 Rejection of invention patent application after publication