CN112486771B - Distributed system management method, system, device and medium - Google Patents

Distributed system management method, system, device and medium Download PDF

Info

Publication number
CN112486771B
CN112486771B CN202011365317.7A CN202011365317A CN112486771B CN 112486771 B CN112486771 B CN 112486771B CN 202011365317 A CN202011365317 A CN 202011365317A CN 112486771 B CN112486771 B CN 112486771B
Authority
CN
China
Prior art keywords
confidence
index
confidence type
node
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011365317.7A
Other languages
Chinese (zh)
Other versions
CN112486771A (en
Inventor
李晓静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202011365317.7A priority Critical patent/CN112486771B/en
Publication of CN112486771A publication Critical patent/CN112486771A/en
Application granted granted Critical
Publication of CN112486771B publication Critical patent/CN112486771B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • G06F9/4411Configuring for operating with peripheral devices; Loading of device drivers

Abstract

The invention discloses a distributed system management method, which comprises the following steps: configuring a confidence type table and an index table; establishing mapping between each confidence type in the confidence type table and one or more indexes in the index table, and establishing mapping between each index in the index table and one or more nodes; collecting and recording corresponding indexes meeting preset conditions from each node to obtain an index recording table; and determining the nodes needing to be processed according to the index record table, the index table and the confidence type table, and processing the nodes needing to be processed according to the processing method configured in the confidence type table. The invention also discloses a system, a computer device and a readable storage medium. The scheme provided by the invention can release the configuration of the index, the confidence type and the coping processing scheme to the user, and the user can autonomously perform related configuration according to actual requirements or different scenes, thereby greatly improving the flexibility of node management.

Description

Distributed system management method, system, device and medium
Technical Field
The present invention relates to the field of distributed systems, and in particular, to a method, system, device, and storage medium for managing a distributed system.
Background
In a distributed system, a single-node fault may become a factor affecting the reliability of the whole system, and especially under the condition of a large cluster scale, finding a node with a problem also has certain difficulty, and a large amount of time is easily consumed. Therefore, based on the consideration of system reliability, the performance monitoring and alarm mechanism design can be carried out on the system nodes. However, the alarm mechanism is to generate an alarm and remind the user that the system has a problem when the node or the system has a problem, and the user needs to find the node having the problem according to the alarm and manually operate the node having the problem according to the alarm information, such as restart, remove, isolate, and the like, so as to solve the problem and restore the cluster. The process of receiving an alarm, troubleshooting a problem and manually operating is time-consuming, and especially under the condition of large cluster scale, the time consumption is increased. In this process, the availability of the system may be affected, even the service business is affected, so that the fault tolerance rate of the system is limited. However, the system alarm mechanism generally aims at a single index, and the problem of the combined index is difficult to find. Meanwhile, the alarm threshold is generally set to be higher, and if a plurality of indexes are increased but all the indexes do not reach the alarm threshold, relevant alarms and processing cannot be generated.
Disclosure of Invention
In view of the above, in order to overcome at least one aspect of the above problems, an embodiment of the present invention provides a distributed system management method, including:
configuring a confidence type table and an index table;
establishing a mapping between each confidence type in the confidence type table and one or more indexes in the index table, and establishing a mapping between each index in the index table and one or more nodes;
collecting and recording corresponding indexes meeting preset conditions from each node to obtain an index record table;
and determining the nodes needing to be processed according to the index record table, the index table and the confidence type table, and processing the nodes needing to be processed according to a processing method configured in the confidence type table.
In some embodiments, configuring the confidence type table further comprises:
creating a plurality of confidence types in the confidence type table;
attributes including a level, a period, a data point, and a processing method for a node are configured for each of the confidence types.
In some embodiments, configuring the metrics table further comprises:
creating a plurality of metrics in the metrics table.
In some embodiments, each entry in the index record table includes an index satisfying a preset condition, an occurrence time, and a corresponding node.
In some embodiments, determining a node that needs to be processed according to the metric record table, the metric table, and the confidence type table further comprises:
determining confidence types respectively corresponding to a plurality of entries belonging to the same node in the index record table according to the mapping relation between each confidence type in the confidence type table and one or more indexes in the index table;
and acquiring a plurality of attributes of the corresponding confidence types according to the confidence type table so as to judge whether the same node needs to be processed or not according to the attributes.
In some embodiments, determining whether the same node needs to be processed according to the attribute further includes:
judging whether the number of the items corresponding to the same confidence type reaches a data point corresponding to the same confidence type in a period corresponding to the same confidence type;
in response to the reaching, it is determined that the same node needs to be processed.
In some embodiments, the processing the node to be processed according to the processing method configured in the confidence type table further includes:
and in response to that the judgment results of the plurality of confidence types on the same node are all required to be processed, selecting a processing method corresponding to the confidence type with the highest grade from the plurality of confidence types to process the same node.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a distributed system management system, including:
a configuration module configured to configure a confidence type table and an indicator table;
a mapping module configured to establish a mapping between each confidence type in the table of confidence types and one or more metrics in the table of metrics, and to establish a mapping between each metric in the table of metrics and one or more nodes;
the acquisition module is configured to acquire and record corresponding indexes meeting preset conditions from each node to obtain an index record table;
the determining module is configured to determine the nodes needing to be processed according to the index record table, the index table and the confidence type table, and process the nodes needing to be processed according to the processing method configured in the confidence type table.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer apparatus, including:
at least one processor; and
a memory storing a computer program operable on the processor, wherein the processor executes the program to perform any of the steps of the distributed system management method as described above.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a computer-readable storage medium storing a computer program which, when executed by a processor, performs the steps of any of the distributed system management methods described above.
The invention has one of the following beneficial technical effects: the scheme provided by the invention can release the configuration of the index, the confidence type and the coping processing scheme to the user, and the user can autonomously perform related configuration according to actual requirements or different scenes, thereby greatly improving the flexibility of node management.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a distributed system management method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a distributed system management system according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a computer device provided in an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are only used for convenience of expression and should not be construed as a limitation to the embodiments of the present invention, and no description is given in the following embodiments.
According to an aspect of the present invention, an embodiment of the present invention provides a distributed system management method, as shown in fig. 1, which may include the steps of:
s1, configuring a confidence type table and an index table;
s2, establishing mapping between each confidence type in the confidence type table and one or more indexes in the index table, and establishing mapping between each index in the index table and one or more nodes;
s3, collecting and recording corresponding indexes meeting preset conditions from each node to obtain an index record table;
and S4, determining the nodes needing to be processed according to the index record table, the index table and the confidence type table, and processing the nodes needing to be processed according to the processing method configured in the confidence type table.
The scheme provided by the invention can release the configuration of the index, the confidence type and the coping processing scheme to the user, and the user can autonomously perform related configuration according to actual requirements or different scenes, thereby greatly improving the flexibility of node management.
In some embodiments, step S1, configuring a confidence type table, further comprises:
creating a plurality of confidence types in the confidence type table;
attributes including a level, a period, a data point, and a processing method for a node are configured for each of the confidence types.
Specifically, the user can customize the confidence type through the confidence type configuration module, and the higher the level of the confidence type is, the higher the corresponding abnormal occurrence probability is, and simultaneously, the attributes including the period, the data point and the operation or operation combination on the node, such as restarting, isolating, removing and the like, are configured for different confidence types. The attribute of each confidence type may be recorded in a confidence _ type _ table.
In some embodiments, in step S1, configuring an index table, further includes:
creating a plurality of metrics in the metrics table.
Specifically, the user can configure the index concerned by the user in the index configuration module according to the requirement, such as the memory occupancy rate, the hard disk read-write rate, the number of serious errors reported, and the like.
In some embodiments, step S2 is to establish a mapping between each confidence type in the confidence type table and one or more indicators in the indicator table, and establish a mapping between each indicator in the indicator table and one or more nodes, specifically, a user may set a corresponding confidence type for an indicator, the same confidence type may be configured to multiple prediction indicators, and a corresponding mapping relationship may exist in the indicator table (prediction _ metrics _ table) or in the confidence type table. The user can also set a corresponding index for each node, the same index can be configured for a plurality of nodes, and the corresponding mapping relation can exist in the index table.
In some embodiments, each entry in the index record table includes an index satisfying a preset condition, an occurrence time, and a corresponding node.
Specifically, the metrics _ record _ table may be used to track the time and occurrence of the received metrics information. After the system performance monitoring collects records about the indexes configured by the user and meeting preset conditions, an entry about the indexes is created in an index record table (metrics _ record _ table), wherein the entry comprises the index (preset _ metric), the time (time) of occurrence and the node (node _ name) of the entry.
In some embodiments, step S4, determining a node to be processed according to the index record table, the index table, and the confidence type table, further includes:
determining confidence types respectively corresponding to a plurality of entries belonging to the same node in the index record table according to the mapping relation between each confidence type in the confidence type table and one or more indexes in the index table;
and acquiring a plurality of attributes of the corresponding confidence types according to the confidence type table so as to judge whether the same node needs to be processed or not according to the attributes.
Specifically, the number of different confidence types on each node is obtained by scanning the metric _ record _ table and combining the prediction metric table (prediction _ metrics _ table). And filtering out the confidence type number of the nodes meeting the configuration conditions by combining a confidence type table (confidence _ type _ table), if only one node meeting the conditions exists, the node is a predicted node with a problem in the system with the maximum probability, and then processing is performed according to the attribute of the confidence type, for example, if the level of the confidence type is configured to be 5, the period of the confidence type is configured to be 3 hours, the data point is 10, the operation is restarting, the type represents the type with the confidence type of 5, and if the record of the prediction index corresponding to the confidence type occurs 10 times in the period of 3 hours, the node with the prediction index is restarted. If a plurality of nodes meeting the conditions exist, the nodes can be compared and analyzed according to the quantity of different confidence types of the nodes, and if the quantity of the highest-level confidence types of the nodes is the maximum, the nodes are judged to be the nodes with the maximum probability of the current problem in advance. In this way, the worst-case node in a system based on user configuration can be predicted and automatically processed according to a node processing mechanism of the opposite signaling type.
In some embodiments, determining whether the same node needs to be processed according to the attribute further includes:
judging whether the number of the items corresponding to the same confidence type reaches a data point corresponding to the same confidence type in a period corresponding to the same confidence type;
in response to the reaching, it is determined that the same node needs to be processed.
Specifically, for example, the index record table includes 10 entries belonging to the same node. The confidence types corresponding to the 10 entries may be one or more. When the data point of the confidence type is 5, 5 entries of the 10 entries correspond to the confidence type, and the 5 entries are in the period of the confidence type, and the node is judged to need to be processed. Here 5 entries may correspond to different metrics.
In some embodiments, processing the node to be processed according to the processing method configured in the confidence type table further includes:
and in response to that the judgment results of the plurality of confidence types on the same node are all required to be processed, selecting a processing method corresponding to the confidence type with the highest grade from the plurality of confidence types to process the same node.
Specifically, when a node needs to perform processing through the determination of multiple confidence types, the processing method corresponding to the confidence type with the highest level may be selected from the multiple confidence types for processing, or the processing methods corresponding to the multiple confidence types may be combined for processing.
The scheme provided by the invention can release the configuration of the index, the confidence type and the coping processing scheme to the user, and the user can autonomously perform related configuration according to actual requirements or different scenes, thereby greatly improving the flexibility of node management.
Based on the same inventive concept, according to another aspect of the present invention, an embodiment of the present invention further provides a distributed system management system 400, as shown in fig. 2, including:
a configuration module 401, wherein the configuration module 401 is configured to configure a confidence type table and an index table;
a mapping module 402, the mapping module 402 configured to establish a mapping between each confidence type in the confidence type table and one or more metrics in the metrics table, and to establish a mapping between each metric in the metrics table and one or more nodes;
an acquisition module 403, where the acquisition module 403 is configured to acquire and record a corresponding index meeting a preset condition from each node to obtain an index record table;
a determining module 404, wherein the determining module 404 is configured to determine a node to be processed according to the index record table, the index table and the confidence type table, and process the node to be processed according to a processing method configured in the confidence type table.
Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 3, an embodiment of the present invention further provides a computer apparatus 501, comprising:
at least one processor 520; and
a memory 510, the memory 510 storing a computer program 511 executable on the processor, the processor 520 executing the program to perform the steps of any of the distributed system management methods as described above.
Based on the same inventive concept, according to another aspect of the present invention, as shown in fig. 4, an embodiment of the present invention further provides a computer-readable storage medium 601, where the computer-readable storage medium 601 stores computer program instructions 610, and the computer program instructions 610, when executed by a processor, perform the steps of any one of the above distributed system management methods.
Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes of the methods of the above embodiments may be implemented by a computer program to instruct related hardware to implement the methods.
Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the above embodiments of the present invention are merely for description, and do not represent the advantages or disadvantages of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, where the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant only to be exemplary, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of an embodiment of the invention, also combinations between technical features in the above embodiments or in different embodiments are possible, and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit or scope of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (9)

1. A distributed system management method, comprising the steps of:
configuring a confidence type table and an index table;
establishing a mapping between each confidence type in the confidence type table and one or more indexes in the index table, and establishing a mapping between each index in the index table and one or more nodes;
collecting and recording corresponding indexes meeting preset conditions from each node to obtain an index record table;
determining nodes needing to be processed according to the index record table, the index table and the confidence type table, and processing the nodes needing to be processed according to a processing method configured in the confidence type table;
wherein, the node needing to be processed is determined according to the index record table, the index table and the confidence type table, and the method further comprises the following steps:
determining confidence types respectively corresponding to a plurality of entries belonging to the same node in the index record table according to the mapping relation between each confidence type in the confidence type table and one or more indexes in the index table;
and acquiring a plurality of attributes of the corresponding confidence types according to the confidence type table so as to judge whether the same node needs to be processed or not according to the attributes.
2. The method of claim 1, wherein configuring a confidence type table further comprises:
creating a plurality of confidence types in the confidence type table;
attributes including a level, a period, a data point, and a processing method for a node are configured for each of the confidence types.
3. The method of claim 1, wherein configuring the metrics table further comprises:
creating a plurality of metrics in the metrics table.
4. The method of claim 2, wherein each entry in the index record table includes an index satisfying a preset condition, an occurrence time, and a corresponding node.
5. The method of claim 4, wherein determining whether the same node needs processing according to the attribute further comprises:
judging whether the number of the items corresponding to the same confidence type reaches a data point corresponding to the same confidence type in a period corresponding to the same confidence type;
in response to the arrival, determining that the same node needs processing.
6. The method of claim 4, wherein the node requiring processing is processed according to a processing method configured in the confidence type table, further comprising:
and in response to that the judgment results of the plurality of corresponding confidence types on the same node are all required to be processed, selecting a processing method corresponding to the confidence type with the highest grade from the plurality of corresponding confidence types to process the same node.
7. A distributed system management system, comprising:
a configuration module configured to configure a confidence type table and an indicator table;
a mapping module configured to establish a mapping between each confidence type in the table of confidence types and one or more metrics in the table of metrics, and to establish a mapping between each metric in the table of metrics and one or more nodes;
the acquisition module is configured to acquire and record corresponding indexes meeting preset conditions from each node to obtain an index record table;
the determining module is configured to determine the nodes needing to be processed according to the index record table, the index table and the confidence type table, and process the nodes needing to be processed according to the processing method configured in the confidence type table;
wherein, the node needing to be processed is determined according to the index record table, the index table and the confidence type table, and the method further comprises the following steps:
determining confidence types respectively corresponding to a plurality of entries belonging to the same node in the index record table according to the mapping relation between each confidence type in the confidence type table and one or more indexes in the index table;
and acquiring a plurality of attributes of the corresponding confidence types according to the confidence type table so as to judge whether the same node needs to be processed or not according to the attributes.
8. A computer device, comprising:
at least one processor; and
memory storing a computer program operable on the processor, characterized in that the processor executes the program to perform the steps of the method according to any of claims 1-6.
9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1-6.
CN202011365317.7A 2020-11-28 2020-11-28 Distributed system management method, system, device and medium Active CN112486771B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011365317.7A CN112486771B (en) 2020-11-28 2020-11-28 Distributed system management method, system, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011365317.7A CN112486771B (en) 2020-11-28 2020-11-28 Distributed system management method, system, device and medium

Publications (2)

Publication Number Publication Date
CN112486771A CN112486771A (en) 2021-03-12
CN112486771B true CN112486771B (en) 2023-01-06

Family

ID=74936877

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011365317.7A Active CN112486771B (en) 2020-11-28 2020-11-28 Distributed system management method, system, device and medium

Country Status (1)

Country Link
CN (1) CN112486771B (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563550A (en) * 2018-04-23 2018-09-21 上海达梦数据库有限公司 A kind of monitoring method of distributed system, device, server and storage medium
CN111241545A (en) * 2020-01-10 2020-06-05 苏州浪潮智能科技有限公司 Software processing method, system, device and medium

Also Published As

Publication number Publication date
CN112486771A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
CN110661659B (en) Alarm method, device and system and electronic equipment
US20170185468A1 (en) Creating A Correlation Rule Defining A Relationship Between Event Types
CN110213068B (en) Message middleware monitoring method and related equipment
CN111600746B (en) Network fault positioning method, device and equipment
CN106933843B (en) Database heartbeat detection method and device
JP4652090B2 (en) Event notification management program, event notification management apparatus, and event notification management method
CN111796959A (en) Host machine container self-healing method, device and system
CN110659147B (en) Self-repairing method and system based on module self-checking behavior
CN112463441B (en) Abnormal task processing method, system, equipment and medium
CN110795264A (en) Monitoring management method and system and intelligent management terminal
CN112486771B (en) Distributed system management method, system, device and medium
CN113806045A (en) Task allocation method, system, device and medium
US9443196B1 (en) Method and apparatus for problem analysis using a causal map
CN109587218B (en) Cluster election method and device
CN108173711B (en) Data exchange monitoring method for internal system of enterprise
CN111309515A (en) Disaster recovery control method, device and system
CN113835916A (en) Ambari big data platform-based alarm method, system and equipment
CN112596986A (en) Monitoring method and device
CN115705259A (en) Fault processing method, related device and storage medium
CN113568781A (en) Database error processing method and device and database cluster access system
CN108897645B (en) Database cluster disaster tolerance method and system based on standby heartbeat disk
CN112925686A (en) Data acquisition method, server, system and storage medium
CN107707402B (en) Management system and management method for service arbitration in distributed system
CN115150253B (en) Fault root cause determining method and device and electronic equipment
CN115134213B (en) Disaster recovery method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant