CN114168071B - Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium - Google Patents

Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium Download PDF

Info

Publication number
CN114168071B
CN114168071B CN202111275738.5A CN202111275738A CN114168071B CN 114168071 B CN114168071 B CN 114168071B CN 202111275738 A CN202111275738 A CN 202111275738A CN 114168071 B CN114168071 B CN 114168071B
Authority
CN
China
Prior art keywords
node
cluster
time length
state
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111275738.5A
Other languages
Chinese (zh)
Other versions
CN114168071A (en
Inventor
赵晓青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan Inspur Data Technology Co Ltd
Original Assignee
Jinan Inspur Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan Inspur Data Technology Co Ltd filed Critical Jinan Inspur Data Technology Co Ltd
Priority to CN202111275738.5A priority Critical patent/CN114168071B/en
Publication of CN114168071A publication Critical patent/CN114168071A/en
Application granted granted Critical
Publication of CN114168071B publication Critical patent/CN114168071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)
  • Retry When Errors Occur (AREA)

Abstract

The invention provides a distributed cluster capacity expansion method, a distributed cluster capacity expansion device and a medium. The distributed cluster expansion method is applied to a main node cluster, and a standby node cluster is also deployed in a distributed storage system deployed by the main node cluster; the method comprises the following steps: and judging whether the first node in the main node cluster is successfully connected. When the first node cannot be connected, judging whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold value. When the first time length is greater than the first time length threshold, the first node is removed. And monitoring the performance state of the master node cluster after the first node is removed. When the performance state is greater than the first state threshold and the performance state duration exceeds the second duration threshold, then a node is selected from the standby node cluster and added to the master node cluster. The method can automatically perform capacity reduction or capacity expansion processing on the main node cluster based on the connection state of the first node so that the main node cluster can meet the load requirement.

Description

Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium
Technical Field
The invention relates to the technical field of internet, in particular to a distributed cluster capacity expansion method, a distributed cluster capacity expansion device and a medium.
Background
Along with the rapid development of cloud computing and big data technology in the development of modern society, the accumulated production data in production and life are exponentially increased, so that the mass storage technology becomes an indispensable part in the development of the Internet.
The distributed storage system disperses and stores data on a plurality of independent devices, and then the devices process the data together. Each device may be understood as a node in the distributed storage system, and thus, each set of nodes is a node cluster. However, in practical application, if some nodes in the node cluster are abnormal, it is necessary to manually identify and determine whether to remove the node, and when the load of the cluster is large, if the node is abnormal and is not processed in time, the overall pressure of the cluster is easily increased, so that the overall collapse of the cluster is caused.
In view of the above, it is necessary to perform a capacity expansion operation on the distributed storage system to ensure that the distributed storage system can operate normally. However, in the related art, when performing the capacity expansion operation on the distributed storage system, it is necessary to manually identify the nodes in an abnormal state, and manually perform the capacity expansion operation, which further results in a great deal of effort and effort to identify the use condition of the node cluster.
Disclosure of Invention
Therefore, the invention aims to overcome the defects that the use condition of the node cluster needs to be identified and checked manually when the capacity expansion operation is carried out on the distributed storage system in the prior art, and the time and the labor are consumed in the capacity expansion process, thereby providing a distributed cluster capacity expansion method, a distributed cluster capacity expansion device and a medium.
According to a first aspect, the invention provides a distributed cluster capacity expansion method, which is applied to a main node cluster, wherein a standby node cluster is also deployed in a distributed storage system deployed by the main node cluster; the method comprises the following steps:
judging whether a first node in the main node cluster is successfully connected or not;
when the first node cannot be connected, judging whether the first time length of the first node in an unconnected state is greater than or equal to a first time length threshold value;
removing the first node when the first time length is greater than a first time length threshold;
monitoring the performance state of the master node cluster after the first node is removed;
and when the performance state is greater than a first state threshold and the performance state duration exceeds a second duration threshold, selecting a node in the standby node cluster and adding the node to the main node cluster.
In the mode, the abnormal nodes in the main node cluster can be automatically identified based on the connection state of the first node, and then the targeted judgment is carried out, so that the main node cluster can be timely subjected to capacity reduction and expansion, the safety of the main node cluster is ensured, the stability of data storage is improved, the overall performance of the main node cluster is improved, the occurrence of serious loss caused by insufficient memory resources of the main node cluster is avoided, and the cost is saved.
With reference to the first aspect, in a first implementation manner of the first aspect, before monitoring a performance state of the master node cluster after removing the first node, the method further includes:
and when the first time length is smaller than a first time length threshold value, reserving the first node.
In this manner, if the first time length is smaller than the first time length threshold, the first node is represented to be temporarily failed, and the connection state can be recovered in a short time, so that the first node is reserved in order to avoid the situation that the load pressure of other first nodes in the main node cluster is too large to influence the stability of data storage after the number of the first nodes is reduced.
With reference to the first aspect or the first implementation manner of the first aspect, in a second implementation manner of the first aspect, after monitoring a performance state of the master node cluster after removing the first node, the method further includes:
when the performance state is larger than a second performance threshold and smaller than a first performance threshold and the duration of the performance state is larger than a third duration threshold, generating a first alarm prompt for performance use early warning;
and reporting the alarm prompt to the distributed storage system.
In the mode, the distributed storage system can be informed by reporting the first alarm prompt, after the first node is reduced, the performance state of the main node cluster is abnormal in the process of processing data, the load pressure is overlarge, and the data processing request is required to be reduced or reduced, so that the aim of early warning is fulfilled.
With reference to the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the method further includes:
and when the performance state is greater than a first performance threshold value but the performance state duration does not exceed the second duration, not selecting a node in the standby node cluster.
In this manner, if the performance state is greater than the first performance threshold and the performance state duration does not exceed the second performance threshold, the main node cluster is represented as temporarily abnormal, and the number of the first nodes in the main node cluster is reduced, but the overall stability of the main node cluster can be ensured, so that in order to avoid resource waste, the nodes added to the main node cluster do not need to be selected from the standby node clusters, thereby contributing to cost reduction.
With reference to the first aspect or the first implementation manner of the first aspect, in a fourth implementation manner of the first aspect, before determining whether the first time length of the first node in the unconnected state is greater than or equal to the first time length threshold, the method further includes:
reconnecting the first node in an unconnected state, and counting the connection times;
and when the connection times are greater than the appointed times, sending a second alarm prompt which is not connectable with the first node to the distributed storage system.
In this way, in order to avoid the occurrence of the influence on the connection result due to factors such as unstable network environment during the connection process, the first node is connected for a plurality of times under the condition that the first node is not connected and within the first sub-threshold, so that the first node can be successfully connected, and the possibility that the first node can be connected is improved. When the connection times are larger than the appointed times, the first node is proved to be unable to connect, and the first node is unable to be called in the running process of the main node cluster, so as to generate and send a second alarm prompt to the distributed storage system, so that the distributed storage system can determine that the first node cannot be used.
With reference to the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, reconnecting the first node in a state that the first node is not connected, and counting a connection number includes:
after the current connection of the first node fails, adding 1 to the connection times of the first node;
and reconnecting the first node after a specified duration interval.
In this way, the number of connection times is counted after the connection failure of the first node, so that the current connection failure can be clarified, and further the accuracy of counting the number of connection times is improved. And the first node is reconnected after the specified duration interval, which is helpful to achieve the purpose of automatic connection.
With reference to the first aspect, in a sixth implementation manner of the first aspect, the performance state includes at least any one or more of the following performance states: CPU usage, memory usage and hard disk usage.
According to a second aspect, the invention provides a distributed cluster capacity expansion device, which is applied to a main node cluster, wherein a standby node cluster is also deployed in a distributed storage system deployed by the main node cluster; the device comprises:
a connection judging unit, configured to judge whether a first node in the master node cluster is successfully connected;
A first time length judging unit, configured to judge whether a first time length of the first node in an unconnected state is greater than or equal to a first time length threshold value when the first node cannot be connected;
the screening unit is used for screening the first node when the first time length is larger than a first time length threshold value;
the monitoring unit is used for monitoring the performance state of the master node cluster after the first node is removed;
and the capacity expansion unit is used for selecting one node from the standby node cluster and adding the node to the main node cluster when the performance state is larger than a first state threshold and the performance state duration exceeds a second duration threshold.
With reference to the second aspect, in a first implementation manner of the second aspect, the screening unit includes:
and the node reservation unit is used for reserving the first node when the first time length is smaller than a first time length threshold value.
With reference to the second aspect or the first implementation of the second aspect, in a second implementation of the second aspect, the apparatus further includes:
the generating unit is used for generating a first alarm prompt for performance use early warning when the performance state is larger than a second performance threshold and smaller than a first performance threshold and the duration of the performance state is larger than a third duration threshold;
And the reporting unit is used for reporting the alarm prompt to the distributed storage system.
With reference to the second implementation manner of the second aspect, in a third implementation manner of the second aspect, the apparatus further includes:
and the selecting unit is used for not selecting nodes in the standby node cluster when the performance state is larger than a first performance threshold value but the performance state duration time does not exceed the second duration time.
With reference to the second aspect or the first implementation manner of the second aspect, in a fourth implementation manner of the second aspect, the apparatus further includes:
the statistics unit is used for reconnecting the first node when the first node is in an unconnected state, and counting the connection times;
and the sending unit is used for sending a second alarm prompt which is unconnectable to the first node to the distributed storage system when the connection times are larger than the appointed times.
With reference to the second aspect, in a fifth implementation manner of the second aspect, the statistics unit includes:
the accumulating unit is used for adding 1 to the connection times of the first node after the current connection failure of the first node;
and the control unit is used for reconnecting the first node after the specified duration interval.
With reference to the second aspect, in a sixth implementation manner of the second aspect, the performance status includes at least any one or more of the following performance states: CPU usage, memory usage and hard disk usage.
According to a third aspect, the embodiment of the present invention further provides a computer device, including a memory and a processor, where the memory and the processor are communicatively connected to each other, and the memory stores computer instructions, and the processor executes the computer instructions, thereby executing the distributed cluster expansion method according to any one of the first aspect and the optional embodiments thereof.
According to a fourth aspect, embodiments of the present invention further provide a computer readable storage medium storing computer instructions for causing the computer to perform the distributed cluster expansion method of any of the first aspect and optional embodiments thereof.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a distributed cluster expansion method according to an exemplary embodiment.
Fig. 2 is a flow chart of another distributed cluster expansion method in accordance with an exemplary embodiment.
Fig. 3 is a flow chart of yet another distributed cluster expansion method in accordance with an exemplary embodiment.
Fig. 4 is a flow chart of yet another distributed cluster expansion method in accordance with an exemplary embodiment.
Fig. 5 is a flow chart of yet another distributed cluster expansion method in accordance with an exemplary embodiment.
Fig. 6 is a flow chart of yet another distributed cluster expansion method in accordance with an exemplary embodiment.
Fig. 7 is a flowchart of yet another distributed cluster expansion method in accordance with an exemplary embodiment.
Fig. 8 is a block diagram of a distributed cluster expansion device according to an exemplary embodiment.
Fig. 9 is a schematic diagram of a hardware structure of a computer device according to an exemplary embodiment.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The distributed storage system disperses and stores data on a plurality of independent devices, and then the devices process the data together. Each device may be understood as a node in the distributed storage system, and thus, each set of nodes is a node cluster.
In practical application, if some nodes in the node cluster are abnormal, it is necessary to identify and determine whether to remove the node manually, and when the load of the cluster is large, the node is abnormal and is not processed in time, the overall pressure of the cluster is easily increased, so that the overall collapse of the cluster is caused. Therefore, it is necessary to perform a capacity expansion operation on the distributed storage system to ensure that the distributed storage system can operate normally.
However, in the related art, when performing the capacity expansion operation on the distributed storage system, it is necessary to manually identify the nodes in an abnormal state, and manually perform the capacity expansion operation, which further results in a great deal of effort and effort to identify the use condition of the node cluster.
In order to solve the above-mentioned problems, in the embodiments of the present invention, a distributed cluster expansion method is provided for a distributed storage system, and it should be noted that an execution body of the distributed cluster expansion method may be a main node cluster deployed in the distributed storage system, and the virtual node may be implemented in a manner of software, hardware, or a combination of software and hardware to become part or all of computer devices. In the following method embodiments, the execution subject is a virtual node as an example.
In the distributed storage system in the embodiment of the invention, a main node cluster and a standby node cluster are deployed. Wherein the nodes within the master node cluster are nodes for data processing. The nodes in the standby node cluster are standby nodes of the main node cluster. The distributed cluster capacity expansion method provided by the invention can determine the connection condition of the main node cluster according to the connection state of the first node, and further remove the first node under the condition that the first node is not connectable so as to ensure the safety of the main node cluster. After the first node is removed, the performance state of the main node cluster after the first node is removed is automatically monitored, so that whether the situation of load balancing abnormality occurs in the distributed storage system or not is facilitated to be found timely, and further the main node cluster is subjected to capacity reduction or capacity expansion treatment timely, so that in the operation process, the main node cluster can meet the load requirement, and meanwhile, data can be normally processed, and therefore the stability of data storage is guaranteed.
Fig. 1 is a flow chart of a distributed cluster expansion method according to an exemplary embodiment. As shown in fig. 1, the distributed cluster expansion method includes the following steps S101 to S105.
In step S101, it is determined whether the connection of the first node in the master node cluster is successful.
In the embodiment of the invention, the first node may be any node in the master node cluster. In one example, to facilitate automatic determination of the connection state of the first node, the connection state of the first node may be determined periodically by means of timing detection.
In step S102, when the first node cannot connect, it is determined whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold.
In the embodiment of the invention, the first node cannot be connected, so that the connection between the first node and the distributed storage system is represented to be abnormal, and the communication between the distributed storage system and the first node is influenced. In order to determine the reason why the first node cannot be connected, if the first node cannot be connected, it is determined whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold. The judgment duration is limited by setting the first duration threshold, so that the situation that the judgment duration is too long and further the follow-up processing is influenced is avoided. Reasons for the inability to connect may include: the first node is temporarily unable to connect due to network instability. Or the network environment of the first node changes, and the first node is in an offline state or no longer belongs to the master node cluster.
In step S103, when the first time period is greater than the first time period threshold, the first node is removed.
In the embodiment of the invention, when the first time length is larger than the first time length threshold value, the fact that the distributed storage system cannot call the first node when data processing is performed is characterized. In order to avoid the situation that the primary node cluster is crashed due to the fact that the primary node cannot be connected, the primary node is removed, the safety of the primary node cluster is further guaranteed, and stability of data storage is improved.
In step S104, the performance state of the master node cluster after the first node is removed is monitored.
In the embodiment of the invention, in order to avoid that after the first node is removed, when the rest nodes in the main node cluster execute data processing, the memory resources of the main node cluster are insufficient to influence the performance state of the main node cluster, the performance state of the main node cluster is monitored after the first node is removed, so that when the performance state of the main node cluster is abnormal, the performance state of the main node cluster can be found in time.
In one example, the performance state may include at least any one or more of the following: the CPU usage, memory usage, and hard disk usage are not limited in the present invention.
In step S105, when the performance state is greater than the first state threshold and the performance state duration exceeds the second duration threshold, then a node is selected from the standby node cluster and added to the primary node cluster.
In the embodiment of the present invention, the first state threshold may be understood as a maximum performance state that can ensure stability of the master node cluster when the master node cluster performs data processing and the callable memory resources are severely limited and insufficient. The second duration threshold may be understood as a maximum duration that can ensure that the master node cluster does not crash in the case that the master node cluster is unstable. If the performance state maximum value is exceeded and the performance state duration exceeds the second duration threshold, the situation that the main node cluster collapses is caused by overlarge load pressure of the main node cluster due to serious shortage of callable memory resources when data processing is executed is represented. Therefore, in order to avoid the loss caused by the breakdown of the main node cluster and improve the stability of data storage, a node is selected from the standby node clusters, and the node is added to the main node cluster, so that the capacity of the main node cluster is expanded in a mode of increasing the node, the storage resources of the main node cluster are further enlarged, the stability of data storage is enhanced, and the cost is saved. For example: taking the first state threshold as an example, the utilization rate of the CPU reaches 90% of the total number of the CPUs in the main node cluster. When the time length of the CPU usage rate exceeds 90% and is greater than the second time length threshold, selecting a node in the standby node cluster, and adding the node to the main node cluster, so that the purpose of capacity expansion is achieved.
In one example, after one node is selected from the standby node cluster, relevant information of the node, such as network card information, an IP address, etc., is acquired, so that when the node is added to the main node cluster and a connection is established with the node, a specific node of the connection can be explicitly established, thereby avoiding the occurrence of a situation of incorrect connection and improving the connection accuracy.
Through the embodiment, the abnormal node in the main node cluster can be automatically identified based on the connection state of the first node, and then the targeted judgment is carried out, so that the main node cluster can be timely subjected to capacity reduction and expansion, the safety of the main node cluster is ensured, the stability of data storage is improved, the overall performance of the main node cluster is improved, the occurrence of serious loss caused by insufficient memory resources of the main node cluster is avoided, and the cost is saved.
In one embodiment, there are two mechanisms in a distributed storage system, including a capacity expansion mechanism and a capacity contraction mechanism. The capacity reduction mechanism is used for judging whether the current master node cluster needs to execute capacity reduction operation or not, so that the safety of the master node cluster is ensured, for example, steps S101-S103 are executed. The capacity expansion mechanism is used for judging whether the current master node cluster needs to execute capacity expansion operation, so that the overall stability of the master node cluster is ensured, for example, steps S104-S105 are executed.
In another embodiment, the distributed storage system further includes a cluster performance monitoring module, where the module exists as a daemon, and detects, at regular time, information about performance of each first node in the host node cluster, including information about CPU, memory, usage of a disk, and the like, so as to determine whether to automatically execute the capacity reduction mechanism or the capacity expansion mechanism.
Fig. 2 is a flow chart of another distributed cluster expansion method in accordance with an exemplary embodiment. As shown in fig. 2, the distributed cluster expansion method includes the following steps.
In step S201, it is determined whether the connection of the first node in the master node cluster is successful.
In step S202, when the first node cannot connect, it is determined whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold.
In step S2031, when the first time length is greater than the first time length threshold, the first node is removed.
In step S2032, when the first time length is less than the first time length threshold, the first node is reserved.
In the embodiment of the invention, if the first time length is smaller than the first time length threshold value, the first node is represented to be temporarily failed, and the connection state can be recovered in a short time, so that the first node is reserved in order to avoid the situation that the load pressure of other first nodes in the main node cluster is overlarge and the stability of data storage is influenced after the number of the first nodes is reduced.
In step S204, the performance state of the master node cluster after the first node is removed is monitored.
In step S205, when the performance state is greater than the first state threshold and the performance state duration exceeds the second duration threshold, then a node is selected from the standby node cluster and added to the primary node cluster.
By the embodiment, whether the first node is reserved or not can be rapidly determined according to the duration of the first node when the first node cannot be connected, so that the identification efficiency is improved.
In one embodiment, steps S201-S2032 may be performed based on a capacity reduction mechanism. Based on the capacity expansion mechanism, steps S204 to S205 are performed.
Fig. 3 is a flow chart of yet another distributed cluster expansion method in accordance with an exemplary embodiment. As shown in fig. 3, the distributed cluster expansion method includes the following steps.
In step S301, it is determined whether the connection of the first node in the master node cluster is successful.
In step S302, when the first node cannot connect, it is determined whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold.
In step S3031, when the first time period is greater than the first time period threshold, the first node is removed.
In step S3032, the first node is reserved when the first time length is less than the first time length threshold.
In step S304, the performance state of the master node cluster after the first node is removed is monitored.
In step S305, when the performance state is greater than the second performance threshold, less than the first performance threshold, and the performance state duration is greater than the third duration threshold, then a first alert prompt for performance use alerting is generated.
In the embodiment of the present invention, the second performance threshold may be understood as a performance state minimum value that the load pressure of the master node cluster exceeds the normal pressure range, but does not affect the stability of the master node cluster when the master node cluster performs data processing. That is, when the performance state exceeds the second performance threshold, the performance state is characterized as exceeding the self-load balance when the data processing request is performed, which affects the processing speed of the subsequent process but does not affect the stability of the main node cluster. For example: taking the first state threshold as an example that the utilization rate of the CPU reaches 90% of the total number of CPUs in the main node cluster, the second performance threshold may reach 70% of the total number of CPUs in the main node cluster. The second performance threshold may be customized according to actual requirements, which is not limited by the present invention.
When the performance state is larger than the second performance threshold and smaller than the first performance threshold and the duration time of the performance state is larger than the third duration threshold, the load pressure of the main node cluster is represented to exceed the normal pressure range and cannot be recovered to the normal pressure range in a short time, so that a first alarm prompt is generated to report the first alarm prompt to the distributed storage system, the distributed storage system is prompted by the first alarm prompt after the first node is reduced, the performance state of the main node cluster is abnormal in the process of processing data, the load pressure is overlarge, and the data processing request is required to be reduced or reduced, so that the early warning purpose is achieved.
In one example, if the performance state includes states of multiple performances, the performance state of each performance may be monitored according to a priority (which may be defined) of each performance. .
In step S306, the alert prompt is reported to the distributed storage system.
In step S307, when the performance state is greater than the first state threshold and the performance state duration exceeds the second duration threshold, then a node is selected from the standby node cluster and added to the primary node cluster.
In one embodiment, steps S301-S3032 may be performed based on a capacity reduction mechanism. Based on the capacity expansion mechanism, steps S304-S307 are performed.
Fig. 4 is a flow chart of yet another distributed cluster expansion method in accordance with an exemplary embodiment. As shown in fig. 4, the distributed cluster expansion method includes the following steps.
In step S401, it is determined whether the connection of the first node in the master node cluster is successful.
In step S402, when the first node cannot connect, it is determined whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold.
In step S4031, when the first time period is greater than the first time period threshold, the first node is removed.
In step S4032, when the first time length is less than the first time length threshold, the first node is reserved.
In step S404, the performance state of the master node cluster after the first node is removed is monitored.
In step S405, when the performance state is greater than the second performance threshold and less than the first performance threshold, and the performance state duration is greater than the third duration threshold, then a first alert prompt for performance use alerting is generated.
In step S406, the alert prompt is reported to the distributed storage system.
In step S4071, when the performance state is greater than the first state threshold and the performance state duration exceeds the second duration threshold, then a node is selected from the standby node cluster and added to the master node cluster.
In step S4072, when the performance state is greater than the first performance threshold but the performance state duration does not exceed the second duration, then no node is selected in the standby node cluster.
In the embodiment of the invention, the performance state is larger than the first performance threshold value, but the duration of the performance state does not exceed the second duration, the main node cluster is characterized as temporarily abnormal, and the number of the first nodes in the main node cluster is reduced, but the overall stability of the main node cluster can be ensured, so that in order to avoid resource waste, the nodes added to the main node cluster do not need to be selected from the standby node clusters, and the cost is further reduced.
In one embodiment, steps S401-S4032 may be performed based on a volume reduction mechanism. Based on the capacity expansion mechanism, steps S404 to S4072 are performed.
Fig. 5 is a flow chart of yet another distributed cluster expansion method in accordance with an exemplary embodiment. As shown in fig. 5, the distributed cluster expansion method includes the following steps.
In step S501, it is determined whether the connection of the first node in the master node cluster is successful.
In step S502, when the first node cannot connect, the first node is reconnected in the unconnected state, and the number of connections is counted.
In the embodiment of the invention, the first node is reconnected, so that the occurrence of influence on the connection result caused by unstable network environment and other factors in the connection process is avoided, and the possibility that the first node can be connected is further improved.
Counting the number of reconnections to the first node makes it possible to clarify the number of attempts to establish a connection with the first node within the first sub-threshold.
In step S503, when the number of connections is greater than the specified number of times, a second alert prompt that the first node is not connectable is sent to the distributed storage system.
In the embodiment of the invention, when the connection times are larger than the appointed times, the first node is proved to be unable to connect, and in the running process of the main node cluster, the first node is unable to be called, and further, a second alarm prompt is sent to the distributed storage system, so that the distributed storage system can determine that the first node cannot be used.
In step S504, it is determined whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold.
In step S5051, when the first time period is greater than the first time period threshold, the first node is removed.
In step S5052, when the first time length is less than the first time length threshold, the first node is reserved.
In step S506, the performance state of the master node cluster after the first node is removed is monitored.
In step S507, when the performance state is greater than the first state threshold and the performance state duration exceeds the second duration threshold, then a node is selected from the standby node cluster and added to the primary node cluster.
By the embodiment, the designated times of connecting the first node are set, so that the situation that the response rate of the whole master node cluster according to the received data processing request is influenced due to overlong times of reconnecting the first node is avoided.
In one embodiment, steps S501-S5052 may be performed based on a capacity reduction mechanism. Based on the capacity expansion mechanism, steps S506 to S507 are performed.
In another embodiment, after the current connection of the first node fails, the number of connections to the first node is increased by 1. After a specified duration interval, the first node is reconnected. That is, after each connection attempt, the connection of the first node fails, 1 is added to the current existing connection times, so that the accuracy of counting the connection times is improved. And reconnecting the first node according to the specified time interval, thereby being beneficial to achieving the purpose of automatic connection.
In an implementation scenario, the distributed cluster expansion method may be as shown in fig. 6. Fig. 6 is a flow chart of yet another distributed cluster expansion method according to an exemplary embodiment.
In step S601, a standby node cluster is configured.
In step S602, the related information of the first node in the master node cluster is detected at a timing.
In step S603, it is determined whether the number of connections is greater than a specified number (10 times).
In step S604, when the number of connections is greater than the specified number of times, a second alert prompt is sent to the distributed storage system that the first node is not connectable.
In step S605, it is determined whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold (e.g., 5 hours).
In step S606, the master node cluster is triggered to automatically execute the capacity reduction mechanism.
In step S607, it is determined whether the performance state of the master node cluster after the first node is removed is greater than a first state threshold (90%) and the performance state duration exceeds a second duration threshold.
In step S608, when it is determined that the performance state is greater than the first state threshold and the performance state duration exceeds the second duration threshold, the master node cluster is triggered to execute the automatic capacity expansion mechanism.
In step S609, a node is selected from the standby node cluster and added to the primary node cluster.
In an embodiment, when the connection of the first node is normal, it may be determined whether to select one node in the standby node cluster directly according to the performance state of the main node cluster detected at the timing, and the node is added to the main node cluster, so as to ensure that the overall performance of the main node cluster can be improved.
In an implementation scenario, the distributed cluster expansion method may be as shown in fig. 7. Fig. 7 is a flowchart of yet another distributed cluster expansion method according to an exemplary embodiment.
In step S701, a standby node cluster is configured.
In step S702, the related information of the first node in the master node cluster is detected at a timing.
In step S703, a performance state of the first node is acquired, including a CPU usage rate, a memory usage rate, and a hard disk usage amount.
In step S704, it is determined that the performance state is greater than the second performance threshold (70%), less than the first performance threshold, and the performance state duration is greater than the third duration threshold.
In step S705, when the performance state is greater than the second performance threshold (70%), less than the first performance threshold, and the performance state duration is greater than the third duration threshold, then a first alert prompt for performance use alerting is generated and reported.
In step S706, it is determined whether the performance state of the master node cluster is greater than a first state threshold (90%) and the performance state duration exceeds a second duration threshold.
In step S707, when the performance state of the primary node cluster is greater than a first state threshold (90%) and the performance state duration exceeds a second duration threshold, a node is selected in the backup node cluster and added to the primary node cluster.
Based on the same inventive concept, the invention also provides a distributed cluster capacity expansion device applied to the master node cluster. The distributed storage system deployed by the main node cluster is also deployed with a standby node cluster.
Fig. 8 is a block diagram of a distributed cluster expansion device according to an exemplary embodiment. As shown in fig. 8, the distributed cluster capacity expansion device includes a connection judging unit 801, a first time length judging unit 802, a screening unit 803, a monitoring unit 804, and a capacity expansion unit 805.
A connection determining unit 801, configured to determine whether a first node in the master node cluster is successfully connected.
A first time length determining unit 802, configured to determine, when the first node cannot connect, whether a first time length of the first node in the unconnected state is greater than or equal to a first time length threshold.
The screening unit 803 is configured to, when the first time length is greater than the first time length threshold, remove the first node.
A monitoring unit 804, configured to monitor a performance state of the master node cluster after the first node is removed.
And the capacity expansion unit 805 is configured to select a node from the standby node clusters and add the node to the main node cluster when the performance state is greater than the first state threshold and the performance state duration exceeds the second duration threshold.
In one embodiment, the screening unit 803 includes: and the node retaining unit is used for retaining the first node when the first time length is smaller than the first time length threshold value.
In another embodiment, the apparatus further comprises: and the generating unit is used for generating a first alarm prompt for early warning of performance use when the performance state is larger than the second performance threshold and smaller than the first performance threshold and the duration of the performance state is larger than the third duration threshold. And the reporting unit is used for reporting the alarm prompt to the distributed storage system.
In yet another embodiment, the apparatus further comprises: and the selecting unit is used for not selecting nodes in the standby node cluster when the performance state is larger than the first performance threshold value but the performance state duration is not longer than the second duration.
In yet another embodiment, the apparatus further comprises: and the statistics unit is used for reconnecting the first node in an unconnected state and counting the connection times. And the sending unit is used for sending a second alarm prompt which is unconnectable to the first node to the distributed storage system when the connection times are larger than the appointed times.
In a further embodiment, the statistics unit comprises: and the accumulating unit is used for adding 1 to the connection times of the first node after the current connection failure of the first node. And the control unit is used for reconnecting the first node after the specified duration interval.
In yet another embodiment, the performance state includes at least any one or more of the following: CPU usage, memory usage and hard disk usage.
The specific limitation of the above-mentioned distributed cluster capacity expansion device and the beneficial effects can be referred to the limitation of the distributed cluster capacity expansion method, and are not repeated here. The various modules described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
Fig. 9 is a schematic diagram of a hardware structure of a computer device according to an exemplary embodiment. As shown in fig. 9, the device includes one or more processors 910 and a memory 920, where the memory 920 includes persistent memory, volatile memory, and a hard disk, one processor 910 being illustrated in fig. 9. The apparatus may further include: an input device 930, and an output device 940.
The processor 910, memory 920, input device 930, and output device 940 may be connected by a bus or other means, for example in fig. 9.
The processor 910 may be a central processing unit (Central Processing Unit, CPU). The processor 910 may also be any other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 920 is used as a non-transitory computer readable storage medium, including persistent memory, volatile memory, and hard disk, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the service management method in the embodiment of the present application. The processor 910 executes various functional applications of the server and data processing, i.e., implements any of the distributed cluster expansion methods described above, by running non-transitory software programs, instructions, and modules stored in the memory 920.
Memory 920 may include a storage program area that may store an operating system, at least one application required for functionality, and a storage data area; the storage data area may store data, etc., as needed, used as desired. In addition, memory 920 may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 920 may optionally include memory located remotely from processor 910, which may be connected to the data processing apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 930 may receive input numeric or character information and generate key signal inputs related to user settings and function control. The output device 940 may include a display device such as a display screen.
One or more modules are stored in the memory 920 that, when executed by the one or more processors 910, perform the methods illustrated in fig. 1-7.
The product can execute the method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. Technical details which are not described in detail in the present embodiment can be found in particular in the relevant description of the embodiments shown in fig. 1 to 7.
The embodiment of the invention also provides a non-transitory computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the authentication method in any of the method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims (10)

1. The distributed cluster capacity expansion method is characterized by being applied to a main node cluster, wherein a standby node cluster is also deployed in a distributed storage system deployed by the main node cluster; the method comprises the following steps:
Judging whether a first node in the main node cluster is successfully connected or not;
when the first node cannot be connected, judging whether the first time length of the first node in an unconnected state is greater than or equal to a first time length threshold value;
removing the first node when the first time length is greater than a first time length threshold;
monitoring the performance state of the master node cluster after the first node is removed;
and when the performance state is greater than a first state threshold and the performance state duration exceeds a second duration threshold, selecting a node in the standby node cluster and adding the node to the main node cluster.
2. The method of claim 1, wherein prior to monitoring the performance state of the master node cluster after removing the first node, the method further comprises:
and when the first time length is smaller than a first time length threshold value, reserving the first node.
3. The method according to claim 1 or 2, wherein after monitoring the performance state of the master node cluster after removing the first node, the method further comprises:
when the performance state is larger than a second performance threshold and smaller than a first performance threshold and the duration of the performance state is larger than a third duration threshold, generating a first alarm prompt for performance use early warning;
And reporting the alarm prompt to the distributed storage system.
4. A method according to claim 3, characterized in that the method further comprises:
and when the performance state is greater than a first performance threshold value but the performance state duration does not exceed the second duration, not selecting a node in the standby node cluster.
5. The method according to claim 1 or 2, wherein before determining whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold, the method further comprises:
reconnecting the first node in an unconnected state, and counting the connection times;
and when the connection times are greater than the appointed times, sending a second alarm prompt which is not connectable with the first node to the distributed storage system.
6. The method of claim 5, wherein reconnecting the first node with the first node in an unconnected state and counting the number of connections comprises:
after the current connection of the first node fails, adding 1 to the connection times of the first node;
and reconnecting the first node after a specified duration interval.
7. The method of claim 1, wherein the performance state comprises at least any one or more of the following: CPU usage, memory usage and hard disk usage.
8. The distributed cluster capacity expansion device is characterized by being applied to a main node cluster, wherein a standby node cluster is also deployed in a distributed storage system deployed by the main node cluster; the device comprises:
a connection judging unit, configured to judge whether a first node in the master node cluster is successfully connected;
a first time length judging unit, configured to judge whether a first time length of the first node in an unconnected state is greater than or equal to a first time length threshold value when the first node cannot be connected;
the screening unit is used for screening the first node when the first time length is larger than a first time length threshold value;
the monitoring unit is used for monitoring the performance state of the master node cluster after the first node is removed;
and the capacity expansion unit is used for selecting one node from the standby node cluster and adding the node to the main node cluster when the performance state is larger than a first state threshold and the performance state duration exceeds a second duration threshold.
9. A computer device comprising a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the distributed cluster expansion method of any of claims 1-7.
10. A computer readable storage medium having stored thereon computer instructions for causing the computer to perform the distributed cluster expansion method of any of claims 1-7.
CN202111275738.5A 2021-10-29 2021-10-29 Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium Active CN114168071B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111275738.5A CN114168071B (en) 2021-10-29 2021-10-29 Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111275738.5A CN114168071B (en) 2021-10-29 2021-10-29 Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium

Publications (2)

Publication Number Publication Date
CN114168071A CN114168071A (en) 2022-03-11
CN114168071B true CN114168071B (en) 2023-11-03

Family

ID=80477825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111275738.5A Active CN114168071B (en) 2021-10-29 2021-10-29 Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium

Country Status (1)

Country Link
CN (1) CN114168071B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115361281B (en) * 2022-08-19 2023-09-22 浙江极氪智能科技有限公司 Processing method, device, equipment and medium for expanding capacity of multiple cloud cluster nodes

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108462760A (en) * 2018-03-21 2018-08-28 平安科技(深圳)有限公司 Electronic device, cluster access domain name automatic generation method and storage medium
CN108776579A (en) * 2018-06-19 2018-11-09 郑州云海信息技术有限公司 A kind of distributed storage cluster expansion method, device, equipment and storage medium
CN109361532A (en) * 2018-09-11 2019-02-19 上海天旦网络科技发展有限公司 The high-availability system and method and computer readable storage medium of network data analysis
WO2020134786A1 (en) * 2018-12-26 2020-07-02 中兴通讯股份有限公司 Server expansion method and device, server and storage medium
CN111464355A (en) * 2020-03-31 2020-07-28 北京金山云网络技术有限公司 Method and device for controlling expansion capacity of Kubernetes container cluster and network equipment
CN111756841A (en) * 2020-06-23 2020-10-09 中国平安财产保险股份有限公司 Service implementation method, device, equipment and storage medium based on micro-service cluster
CN113051075A (en) * 2021-03-23 2021-06-29 烽火通信科技股份有限公司 Kubernetes intelligent capacity expansion method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108462760A (en) * 2018-03-21 2018-08-28 平安科技(深圳)有限公司 Electronic device, cluster access domain name automatic generation method and storage medium
CN108776579A (en) * 2018-06-19 2018-11-09 郑州云海信息技术有限公司 A kind of distributed storage cluster expansion method, device, equipment and storage medium
CN109361532A (en) * 2018-09-11 2019-02-19 上海天旦网络科技发展有限公司 The high-availability system and method and computer readable storage medium of network data analysis
WO2020134786A1 (en) * 2018-12-26 2020-07-02 中兴通讯股份有限公司 Server expansion method and device, server and storage medium
CN111371583A (en) * 2018-12-26 2020-07-03 中兴通讯股份有限公司 Server capacity expansion method and device, server and storage medium
CN111464355A (en) * 2020-03-31 2020-07-28 北京金山云网络技术有限公司 Method and device for controlling expansion capacity of Kubernetes container cluster and network equipment
CN111756841A (en) * 2020-06-23 2020-10-09 中国平安财产保险股份有限公司 Service implementation method, device, equipment and storage medium based on micro-service cluster
CN113051075A (en) * 2021-03-23 2021-06-29 烽火通信科技股份有限公司 Kubernetes intelligent capacity expansion method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
城市轨道交通视频云存储系统架构及功能模块设计;蔡京军;刘晓宇;王珊珊;沈强;潘皓;;城市轨道交通研究(12);全文 *

Also Published As

Publication number Publication date
CN114168071A (en) 2022-03-11

Similar Documents

Publication Publication Date Title
CN113014634B (en) Cluster election processing method, device, equipment and storage medium
EP3335120B1 (en) Method and system for resource scheduling
CN110830283B (en) Fault detection method, device, equipment and system
CN110740072B (en) Fault detection method, device and related equipment
CN106533805B (en) Micro-service request processing method, micro-service controller and micro-service architecture
US10846186B2 (en) Central processing unit CPU hot-remove method and apparatus, and central processing unit CPU hot-add method and apparatus
CN109274544B (en) Fault detection method and device for distributed storage system
CN107508694B (en) Node management method and node equipment in cluster
CN112671928B (en) Equipment centralized management architecture, load balancing method, electronic equipment and storage medium
CN110677480B (en) Node health management method and device and computer readable storage medium
CN114168071B (en) Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium
CN114531373A (en) Node state detection method, node state detection device, equipment and medium
CN109510730B (en) Distributed system, monitoring method and device thereof, electronic equipment and storage medium
CN113489149A (en) Power grid monitoring system service master node selection method based on real-time state perception
CN113765690A (en) Cluster switching method, system, device, terminal, server and storage medium
CN111614701B (en) Distributed cluster and container state switching method and device
CN111309515B (en) Disaster recovery control method, device and system
JP2007280155A (en) Reliability improving method in dispersion system
CN113778763B (en) Intelligent switching method and system for three-way interface service faults
CN113596195B (en) Public IP address management method, device, main node and storage medium
CN113568781B (en) Database error processing method and device and database cluster access system
CN111934909B (en) Main-standby machine IP resource switching method, device, computer equipment and storage medium
CN113157493A (en) Backup method, device and system based on ticket checking system and computer equipment
CN112612652A (en) Distributed storage system abnormal node restarting method and system
CN112068935A (en) Method, device and equipment for monitoring deployment of kubernets program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant