CN114168071A - Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium - Google Patents

Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium Download PDF

Info

Publication number
CN114168071A
CN114168071A CN202111275738.5A CN202111275738A CN114168071A CN 114168071 A CN114168071 A CN 114168071A CN 202111275738 A CN202111275738 A CN 202111275738A CN 114168071 A CN114168071 A CN 114168071A
Authority
CN
China
Prior art keywords
node
cluster
duration
performance
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111275738.5A
Other languages
Chinese (zh)
Other versions
CN114168071B (en
Inventor
赵晓青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan Inspur Data Technology Co Ltd
Original Assignee
Jinan Inspur Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan Inspur Data Technology Co Ltd filed Critical Jinan Inspur Data Technology Co Ltd
Priority to CN202111275738.5A priority Critical patent/CN114168071B/en
Publication of CN114168071A publication Critical patent/CN114168071A/en
Application granted granted Critical
Publication of CN114168071B publication Critical patent/CN114168071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Retry When Errors Occur (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention provides a distributed cluster capacity expansion method, a distributed cluster capacity expansion device and a medium. The distributed cluster capacity expansion method is applied to a main node cluster, and a standby node cluster is also deployed in a distributed storage system deployed by the main node cluster; the method comprises the following steps: and judging whether the first node in the main node cluster is successfully connected. When the first node can not be connected, whether the first time length of the first node in the unconnected state is larger than or equal to a first time length threshold value or not is judged. And when the first duration is greater than the first duration threshold, removing the first node. And monitoring the performance state of the main node cluster after the first node is removed. And when the performance state is greater than the first state threshold and the performance state duration exceeds a second duration threshold, selecting one node from the standby node cluster and adding the node to the main node cluster. Based on the connection state of the first node, capacity reduction or capacity expansion processing can be automatically carried out on the main node cluster, so that the main node cluster can meet the load requirement.

Description

Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium
Technical Field
The invention relates to the technical field of internet, in particular to a distributed cluster capacity expansion method, a distributed cluster capacity expansion device and a medium.
Background
With the rapid development of cloud computing and big data technology in the development of modern society, production data accumulated in production and life are exponentially increased, and further, a mass storage technology is increasingly becoming an indispensable part in the development of the internet.
In a distributed storage system, data is stored in a distributed manner in a plurality of independent devices, and the data is processed by each device. Each device may be understood as a node in the distributed storage system, and the collection of nodes is a node cluster. However, in practical applications, if some nodes in the node cluster are abnormal, a human needs to actively identify and determine whether to remove the node, and when the cluster load is large, the node is abnormal and is not processed in time, the overall pressure of the cluster is easily increased, and the cluster is further broken down.
In view of the above situation, it is necessary to perform capacity expansion operation on the distributed storage system to ensure that the distributed storage system can operate normally. However, in the related art, when the capacity expansion operation is performed on the distributed storage system, the nodes in the abnormal state need to be manually identified, and the capacity expansion operation needs to be manually performed, so that a great deal of manpower and energy are consumed to identify the use condition of the node cluster.
Disclosure of Invention
Therefore, the technical problem to be solved by the present invention is to overcome the defect that time and labor are consumed in the capacity expansion process due to the fact that when the capacity expansion operation is performed on the distributed storage system in the prior art, the use condition of the node cluster needs to be manually identified and checked, so as to provide a distributed cluster capacity expansion method, a distributed cluster capacity expansion device and a medium.
According to a first aspect, the present invention provides a distributed cluster capacity expansion method, which is applied to a master node cluster, wherein a backup node cluster is also deployed in a distributed storage system deployed by the master node cluster; the method comprises the following steps:
judging whether the first node in the main node cluster is successfully connected;
when the first node cannot be connected, judging whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold value;
when the first time length is larger than a first time length threshold value, removing the first node;
monitoring the performance state of the main node cluster after the first node is removed;
and when the performance state is greater than a first state threshold and the duration of the performance state exceeds a second duration threshold, selecting a node from the standby node cluster and adding the node to the main node cluster.
In this mode, can be based on the connection status of first node, automatic identification has unusual node in the main node cluster, and then carries out the judgement of pertinence, so that can in time contract and the dilatation handle main node cluster, guarantee main node cluster's safety, thereby improve data storage's stability, so that promote main node cluster's wholeness can, avoid because main node cluster memory resource is not enough to cause the condition of serious loss to take place, be favorable to saving the cost.
With reference to the first aspect, in a first implementation manner of the first aspect, before monitoring a performance state of the master node cluster after removing the first node, the method further includes:
and when the first time length is smaller than a first time length threshold value, reserving the first node.
In this way, if the first duration is smaller than the first duration threshold, it is characterized that the first node is in a temporary failure, and the connection state can be recovered in a short time, so as to avoid the situation that the load pressure of other first nodes in the master node cluster is too large to affect the stability of data storage after the number of the first nodes is reduced, and then the first node is retained.
With reference to the first aspect or the first implementation manner of the first aspect, in a second implementation manner of the first aspect, after monitoring a performance state of the master node cluster after removing the first node, the method further includes:
when the performance state is larger than a second performance threshold, smaller than a first performance threshold and the duration of the performance state is larger than a third duration threshold, generating a first warning prompt for performance use warning;
and reporting the alarm prompt to the distributed storage system.
In this way, the distributed storage system can be informed by reporting the first alarm prompt, after the first nodes are reduced, the performance state of the main node cluster is abnormal in the process of processing data, the load pressure is too high, and the data processing request needs to be reduced or reduced, so that the purpose of early warning is achieved.
With reference to the second embodiment of the first aspect, in a third embodiment of the first aspect, the method further comprises:
and when the performance state is larger than a first performance threshold but the duration of the performance state does not exceed the second duration, not selecting a node in the standby node cluster.
In this way, if the performance state is greater than the first performance threshold but the duration of the performance state does not exceed the second duration, it is characterized that the master node cluster is temporarily abnormal, and the number of the first nodes in the master node cluster is reduced, but the overall stability of the master node cluster can be guaranteed.
With reference to the first aspect or the first implementation manner of the first aspect, in a fourth implementation manner of the first aspect, before determining whether the first duration that the first node is in the unconnected state is greater than or equal to a first duration threshold, the method further includes:
reconnecting the first node when the first node is in an unconnected state, and counting connection times;
and when the connection times are more than the specified times, sending a second alarm prompt that the first node is not connectable to the distributed storage system.
In this way, in order to avoid that factors such as unstable network environment affect the connection result during the connection process, the first node is connected for multiple times under the condition that the first node is not connected within the first sub-threshold, so that the first node can be successfully connected, and the possibility that the first node can be connected is improved. And when the connection times are greater than the specified times, the first node is proved to be unable to be connected, and the first node is unable to be called in the operation process of the main node cluster, so that a second alarm prompt is generated and sent to the distributed storage system, and the distributed storage system can make sure that the first node cannot be used.
With reference to the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the reconnecting the first node and counting the number of connections when the first node is in an unconnected state includes:
after the current connection of the first node fails, adding 1 to the connection times of the first node;
and after a specified time interval, reconnecting the first node.
In the method, the connection times are counted after the first node fails to connect, so that the current connection failure can be determined, and the accuracy of counting the connection times is improved. And according to the specified time interval, the first node is reconnected, which is beneficial to achieving the purpose of automatic connection.
With reference to the first aspect, in a sixth embodiment of the first aspect, the performance state includes at least any one or more of the following: CPU utilization rate, memory utilization rate and hard disk utilization rate.
According to a second aspect, the present invention provides a distributed cluster capacity expansion apparatus, which is applied to a master node cluster, wherein a backup node cluster is also deployed in a distributed storage system deployed by the master node cluster; the device comprises:
a connection judging unit, configured to judge whether a first node in the master node cluster is successfully connected;
a first duration judging unit, configured to judge whether a first duration of an unconnected state of the first node is greater than or equal to a first duration threshold when the first node cannot be connected;
a screening unit, configured to remove the first node when the first duration is greater than a first duration threshold;
the monitoring unit is used for monitoring the performance state of the main node cluster after the first node is removed;
and the capacity expansion unit is used for selecting one node in the standby node cluster and adding the node to the main node cluster when the performance state is greater than a first state threshold and the duration of the performance state exceeds a second duration threshold.
With reference to the second aspect, in a first embodiment of the second aspect, the sifting unit includes:
a node reserving unit, configured to reserve the first node when the first duration is smaller than a first duration threshold.
With reference to the second aspect or the first embodiment of the second aspect, in a second embodiment of the second aspect, the apparatus further comprises:
the generating unit is used for generating a first warning prompt for performance use warning when the performance state is larger than a second performance threshold, smaller than a first performance threshold and the duration of the performance state is larger than a third duration threshold;
and the reporting unit is used for reporting the alarm prompt to the distributed storage system.
With reference to the second embodiment of the second aspect, in a third embodiment of the second aspect, the apparatus further comprises:
a selecting unit, configured to not select a node in the backup node cluster when the performance status is greater than a first performance threshold but the duration of the performance status does not exceed the second duration.
With reference to the second aspect or the first embodiment of the second aspect, in a fourth embodiment of the second aspect, the apparatus further comprises:
a counting unit, configured to reconnect the first node when the first node is in an unconnected state, and count connection times;
and the sending unit is used for sending a second alarm prompt that the first node is not connectable to the distributed storage system when the connection times are greater than the specified times.
With reference to the second aspect, in a fifth embodiment of the second aspect, the statistical unit includes:
the accumulation unit is used for adding 1 to the connection times of the first node after the current connection of the first node fails;
and the control unit is used for reconnecting the first node after a specified time interval.
With reference to the second aspect, in a sixth embodiment of the second aspect, the performance state includes at least any one or more of the following: CPU utilization rate, memory utilization rate and hard disk utilization rate.
According to a third aspect, the present invention further provides a computer device, which includes a memory and a processor, where the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the computer instructions to perform the distributed cluster capacity method according to any one of the first aspect and the optional embodiments thereof.
According to a fourth aspect, the present invention further provides a computer-readable storage medium, which stores computer instructions for causing a computer to execute the distributed cluster capacity method of any one of the first aspect and the optional embodiments thereof.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a distributed cluster capacity expansion method according to an exemplary embodiment.
Fig. 2 is a flow chart of another proposed distributed cluster capacity method according to an example embodiment.
Fig. 3 is a flowchart of a further distributed cluster capacity method according to an exemplary embodiment.
Fig. 4 is a flowchart of a further distributed cluster capacity method according to an exemplary embodiment.
Fig. 5 is a flowchart of a further distributed cluster capacity method according to an exemplary embodiment.
Fig. 6 is a flowchart of a further distributed cluster capacity method according to an exemplary embodiment.
Fig. 7 is a flowchart of a further distributed cluster capacity method according to an exemplary embodiment.
Fig. 8 is a block diagram of a distributed cluster extension apparatus according to an exemplary embodiment.
Fig. 9 is a hardware configuration diagram of a computer device according to an exemplary embodiment.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In a distributed storage system, data is stored in a distributed manner in a plurality of independent devices, and the data is processed by each device. Each device may be understood as a node in the distributed storage system, and the collection of nodes is a node cluster.
In practical application, if some nodes in a node cluster are abnormal, a person is required to actively identify and judge whether to remove the node, and when the cluster load is large, the node is abnormal and is not processed in time, the overall pressure of the cluster is easily increased, and further the cluster is broken down. Therefore, it is necessary to perform capacity expansion operation on the distributed storage system to ensure that the distributed storage system can operate normally.
However, in the related art, when the capacity expansion operation is performed on the distributed storage system, the nodes in the abnormal state need to be manually identified, and the capacity expansion operation needs to be manually performed, so that a great deal of manpower and energy are consumed to identify the use condition of the node cluster.
In order to solve the foregoing problems, an embodiment of the present invention provides a distributed cluster capacity expansion method, which is used in a distributed storage system, where an execution main body of the distributed cluster capacity expansion method may be a main node cluster deployed in the distributed storage system, and the virtual node may be implemented by software, hardware, or a combination of software and hardware to become part or all of a computer device. In the following method embodiments, the execution subject is a virtual node as an example.
In the distributed storage system in the embodiment of the invention, a main node cluster and a standby node cluster are deployed. Wherein the nodes within the master node cluster are nodes for data processing. And the nodes in the standby node cluster are standby nodes of the main node cluster. By the aid of the distributed cluster capacity expansion method, the connection condition of the main node cluster can be determined according to the connection state of the first node, and the first node is removed under the condition that the first node is not connectable, so that safety of the main node cluster is guaranteed. After the first node is removed, the performance state of the main node cluster after the first node is removed is automatically monitored, whether abnormal load balancing conditions occur in the distributed storage system or not is helped to be found in time, and then capacity reduction or capacity expansion processing is conducted on the main node cluster in time, so that when the main node cluster can meet load requirements in the operation process, data can be normally processed, and stability of data storage is guaranteed.
Fig. 1 is a flowchart of a distributed cluster capacity expansion method according to an exemplary embodiment. As shown in fig. 1, the distributed cluster capacity expansion method includes the following steps S101 to S105.
In step S101, it is determined whether the first node in the master node cluster is successfully connected.
In this embodiment of the present invention, the first node may be any node in the master node cluster. In one example, to facilitate automatic determination of the connection state of the first node, the connection state of the first node may be determined periodically by means of timing detection.
In step S102, when the first node cannot be connected, it is determined whether the first duration of the unconnected state of the first node is greater than or equal to a first duration threshold.
In the embodiment of the invention, the first node cannot be connected, which indicates that the connection between the first node and the distributed storage system is abnormal, and influences the communication between the distributed storage system and the first node. In order to determine the reason why the first node cannot be connected, under the condition that the first node cannot be connected, whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold value or not is judged. The judgment time length is limited by setting the first time length threshold, so that the judgment time length is prevented from being too long, and further the subsequent processing is influenced. Reasons for the inability to connect may include: the first node is temporarily unable to connect due to network instability. Or the network environment of the first node is changed, and the first node is in an offline state or no longer belongs to the main node cluster.
In step S103, when the first duration is greater than the first duration threshold, the first node is removed.
In the embodiment of the present invention, when the first duration is greater than the first duration threshold, the distributed storage system cannot invoke the first node when data processing is performed. In order to avoid the crash of the main node cluster caused by the fact that the first nodes cannot be connected, the first nodes are removed, so that the safety of the main node cluster is guaranteed, and the stability of data storage is improved.
In step S104, the performance status of the master node cluster after the first node is removed is monitored.
In the embodiment of the invention, in order to avoid that the performance state of the main node cluster is influenced by insufficient memory resources of the main node cluster when the remaining nodes in the main node cluster execute data processing after the first node is removed, the performance state of the main node cluster is monitored after the first node is removed, so that the performance state of the main node cluster can be found in time when the performance state of the main node cluster is abnormal.
In one example, the performance state may include at least any one or more of the following: the CPU utilization, the memory utilization, and the hard disk utilization are not limited in the present invention.
In step S105, when the performance status is greater than the first status threshold and the duration of the performance status exceeds the second duration threshold, one node is selected from the standby node cluster, and the node is added to the master node cluster.
In this embodiment of the present invention, the first state threshold may be understood as a maximum value of a stable performance state of the master node cluster, when the master node cluster executes data processing, under a condition that the memory resource that can be called is severely limited and insufficient. The second duration threshold may be understood as a maximum duration that can ensure that the master node cluster does not crash when the master node cluster is unstable. If the performance state maximum value is exceeded and the performance state duration exceeds the second duration threshold, the behavior is characterized that the load pressure of the main node cluster is too large due to serious shortage of the memory resources which can be called when the data processing is executed, and further the main node cluster is crashed. Therefore, in order to avoid the loss caused by the collapse of the main node cluster and improve the stability of data storage, one node is selected from the standby node cluster, and the node is added to the main node cluster, so that the capacity of the main node cluster is expanded by increasing the nodes, the storage resources of the main node cluster are further expanded, the stability of data storage is enhanced, and the cost is saved. For example: taking the first state threshold as the CPU utilization rate to reach 90% of the total CPU number in the main node cluster as an example. And when the time length of the CPU utilization rate exceeds 90% and is greater than a second time length threshold value, selecting one node from the standby node cluster, and adding the node to the main node cluster to achieve the purpose of capacity expansion.
In one example, after a node is selected from the cluster of standby nodes, relevant information of the node, such as network card information, an IP address, and the like, is acquired, so that the node is subsequently added to the cluster of master nodes, and when a connection is established with the node, a specific node for the connection can be definitely established, thereby avoiding a situation of misconnection and improving connection accuracy.
Through the embodiment, abnormal nodes in the main node cluster can be automatically identified based on the connection state of the first node, and then pertinence judgment is carried out, so that timely capacity reduction and capacity expansion processing can be carried out on the main node cluster, the safety of the main node cluster is guaranteed, the stability of data storage is improved, the overall performance of the main node cluster is promoted, the occurrence of the condition of serious loss caused by insufficient memory resources of the main node cluster is avoided, and the cost is saved.
In one embodiment, there are two mechanisms in a distributed storage system, including a capacity expansion mechanism and a capacity reduction mechanism. The capacity reduction mechanism is used for determining whether the current master node cluster needs to perform the capacity reduction operation, so as to ensure the security of the master node cluster, for example, steps S101 to S103 are performed. The capacity expansion mechanism is used to determine whether the current master node cluster needs to perform a capacity expansion operation, so as to ensure the overall stability of the master node cluster, for example, execute steps S104 to S105.
In another embodiment, the distributed storage system further includes a cluster performance monitoring module, which exists as a daemon process, and periodically detects performance-related information of each first node in the host node cluster, including information such as CPU, memory, and disk usage, and further determines whether to automatically execute a capacity reduction mechanism or a capacity expansion mechanism.
Fig. 2 is a flow chart of another proposed distributed cluster capacity method according to an example embodiment. As shown in fig. 2, the distributed cluster capacity expansion method includes the following steps.
In step S201, it is determined whether the first node in the master node cluster is successfully connected.
In step S202, when the first node cannot be connected, it is determined whether the first duration of the unconnected state of the first node is greater than or equal to a first duration threshold.
In step S2031, when the first duration is greater than the first duration threshold, the first node is removed.
In step S2032, when the first duration is less than the first duration threshold, the first node is retained.
In the embodiment of the present invention, if the first duration is smaller than the first duration threshold, it is characterized that the first node is in a temporary failure, and the connection state can be recovered in a short time, so as to avoid the situation that the load pressure of other first nodes in the master node cluster is too large to affect the stability of data storage after the number of the first nodes is reduced, and then the first node is retained.
In step S204, the performance status of the master node cluster after the first node is removed is monitored.
In step S205, when the performance status is greater than the first status threshold and the duration of the performance status exceeds the second duration threshold, one node is selected from the standby node cluster, and the node is added to the master node cluster.
Through the embodiment, whether the first node is reserved or not can be quickly determined according to the time length of the first node when the first node cannot be connected, and therefore the identification efficiency is improved.
In an embodiment, steps S201-S2032 may be performed based on a capacity reduction mechanism. Based on the capacity expansion mechanism, steps S204-S205 are performed.
Fig. 3 is a flowchart of a further distributed cluster capacity method according to an exemplary embodiment. As shown in fig. 3, the distributed cluster capacity expansion method includes the following steps.
In step S301, it is determined whether the first node in the master node cluster is successfully connected.
In step S302, when the first node cannot be connected, it is determined whether the first duration of the unconnected state of the first node is greater than or equal to a first duration threshold.
In step S3031, when the first time length is greater than the first time length threshold, the first node is removed.
In step S3032, when the first time length is smaller than the first time length threshold, the first node is reserved.
In step S304, the performance status of the master node cluster after the first node is removed is monitored.
In step S305, when the performance status is greater than the second performance threshold, less than the first performance threshold, and the performance status duration is greater than the third duration threshold, a first warning prompt for performance use warning is generated.
In this embodiment of the present invention, the second performance threshold may be understood as a minimum value of the performance state that does not affect the stability of the main node cluster, but the load pressure of the main node cluster exceeds a normal pressure range when the main node cluster performs data processing. That is, when the performance state exceeds the second performance threshold, it is characterized that the performance state exceeds self load balance when a data processing request is made, which may affect the processing speed of a subsequent process, but does not affect the stability of the master node cluster. For example: taking the first state threshold as the case that the utilization rate of the CPUs reaches 90% of the total number of the CPUs in the master node cluster, the second performance threshold may be 70% of the total number of the CPUs in the master node cluster. The second performance threshold may be customized according to actual requirements, and the present invention is not limited thereto.
When the performance state is greater than the second performance threshold and less than the first performance threshold, and the performance state duration is greater than a third duration threshold, it is characterized that the load pressure of the master node cluster exceeds a normal pressure range and cannot be restored to the normal pressure range in a short time, so that a first alarm prompt is generated so as to report the first alarm prompt to the distributed storage system, and after the first node is reduced by the distributed storage system, the performance state of the master node cluster is abnormal in the data processing process, the load pressure is too high, and the data processing request needs to be reduced or reduced, thereby achieving the purpose of early warning.
In one example, if the performance state includes multiple states of performance, the performance state of each performance may be monitored according to a priority (which may be defined) of each performance. .
In step S306, the alarm prompt is reported to the distributed storage system.
In step S307, when the performance status is greater than the first status threshold and the duration of the performance status exceeds the second duration threshold, one node is selected from the standby node cluster, and the node is added to the master node cluster.
In an embodiment, steps S301-S3032 may be performed based on a capacity reduction mechanism. Based on the capacity expansion mechanism, steps S304-S307 are performed.
Fig. 4 is a flowchart of a further distributed cluster capacity method according to an exemplary embodiment. As shown in fig. 4, the distributed cluster capacity expansion method includes the following steps.
In step S401, it is determined whether the first node in the master node cluster is successfully connected.
In step S402, when the first node cannot be connected, it is determined whether the first duration of the unconnected state of the first node is greater than or equal to a first duration threshold.
In step S4031, when the first duration is greater than the first duration threshold, the first node is removed.
In step S4032, when the first duration is less than the first duration threshold, the first node is retained.
In step S404, the performance status of the master node cluster after the first node is removed is monitored.
In step S405, when the performance status is greater than the second performance threshold, less than the first performance threshold, and the performance status duration is greater than the third duration threshold, a first warning prompt for performance use warning is generated.
In step S406, the alarm prompt is reported to the distributed storage system.
In step S4071, when the performance state is greater than the first state threshold and the duration of the performance state exceeds the second duration threshold, one node is selected from the standby node cluster, and the node is added to the master node cluster.
In step S4072, when the performance state is greater than the first performance threshold but the performance state duration does not exceed the second duration, no node is selected in the cluster of standby nodes.
In the embodiment of the invention, if the performance state is greater than the first performance threshold but the duration of the performance state does not exceed the second duration, it is characterized that the master node cluster is abnormal temporarily, and the number of the first nodes in the master node cluster is reduced, but the overall stability of the master node cluster can be guaranteed.
In an embodiment, steps S401-S4032 may be performed based on a capacity reduction mechanism. Based on the capacity expansion mechanism, steps S404-S4072 are performed.
Fig. 5 is a flowchart of a further distributed cluster capacity method according to an exemplary embodiment. As shown in fig. 5, the distributed cluster capacity expansion method includes the following steps.
In step S501, it is determined whether the first node in the master node cluster is successfully connected.
In step S502, when the first node cannot be connected, the first node is reconnected in a state where the first node is not connected, and the number of times of connection is counted.
In the embodiment of the invention, the first node is reconnected, which is beneficial to avoiding the influence on the connection result caused by factors such as unstable network environment in the connection process, and further improves the possibility that the first node can be connected.
Counting the number of reconnections to the first node makes it possible to specify the number of attempts to establish a connection with the first node within the first sub-threshold.
In step S503, when the connection number is greater than the designated number, a second warning prompt that the first node is not connectable is sent to the distributed storage system.
In the embodiment of the invention, when the connection times are more than the specified times, the first node is proved to be unable to be connected, and the first node is unable to be called in the operation process of the main node cluster, so that a second alarm prompt is sent to the distributed storage system, and the distributed storage system can determine that the first node cannot be used.
In step S504, it is determined whether the first duration that the first node is in the unconnected state is greater than or equal to a first duration threshold.
In step S5051, when the first duration is greater than the first duration threshold, the first node is removed.
In step S5052, when the first duration is less than the first duration threshold, the first node is retained.
In step S506, the performance status of the master node cluster after the first node is removed is monitored.
In step S507, when the performance status is greater than the first status threshold and the duration of the performance status exceeds the second duration threshold, one node is selected from the standby node cluster, and the node is added to the master node cluster.
Through the embodiment, the appointed times of connecting the first nodes are set, so that the situation that the number of times of reconnecting the first nodes is too long, and the response rate of the whole main node cluster according to the received data processing request is influenced is avoided.
In an embodiment, steps S501-S5052 may be performed based on a capacity reduction mechanism. Based on the capacity expansion mechanism, steps S506-S507 are performed.
In another embodiment, after the current connection of the first node fails, the number of connections to the first node is increased by 1. And after the specified time interval, reconnecting the first node. That is, after connection is attempted once, if the connection of the first node fails, 1 is added to the currently existing connection frequency, which is further helpful for improving the accuracy of counting the connection frequency. And after the interval of the specified time, the first node is reconnected, which is favorable for achieving the purpose of automatic connection.
In an implementation scenario, the distributed cluster capacity method may be as shown in fig. 6. Fig. 6 is a flowchart of a further distributed cluster capacity method according to an exemplary embodiment.
In step S601, a backup node cluster is configured.
In step S602, relevant information of a first node in the master node cluster is detected at regular time.
In step S603, it is determined whether the number of connections is greater than a specified number (10).
In step S604, when the connection number is greater than the designated number, a second warning prompt that the first node is not connectable is sent to the distributed storage system.
In step S605, it is determined whether the first duration of the first node in the unconnected state is greater than or equal to a first duration threshold (e.g., 5 hours).
In step S606, the master node cluster is triggered to automatically execute the capacity reduction mechanism.
In step S607, it is determined whether the performance state of the master node cluster after the first node is removed is greater than a first state threshold (90%) and the duration of the performance state exceeds a second duration threshold.
In step S608, when it is determined that the performance status is greater than the first status threshold and the duration of the performance status exceeds the second duration threshold, the master node cluster is triggered to execute the automatic capacity expansion mechanism.
In step S609, a node is selected from the backup node cluster and added to the master node cluster.
In an embodiment, when the connection of the first node is normal, it may be determined whether to select a node from the standby node cluster directly according to the performance state of the main node cluster detected at regular time, and the node is added to the main node cluster, so as to ensure that the overall performance of the main node cluster can be improved.
In an implementation scenario, the distributed cluster capacity method may be as shown in fig. 7. Fig. 7 is a flowchart of a further distributed cluster capacity method according to an exemplary embodiment.
In step S701, a backup node cluster is configured.
In step S702, the relevant information of the first node in the master node cluster is detected at regular time.
In step S703, the performance state of the first node is obtained, which includes a CPU usage rate, a memory usage rate, and a hard disk usage rate.
In step S704, it is determined that the performance status is greater than the second performance threshold (70%), less than the first performance threshold, and the performance status duration is greater than the third duration threshold.
In step S705, when the performance status is greater than the second performance threshold (70%), less than the first performance threshold, and the duration of the performance status is greater than the third duration threshold, a first warning prompt for performance use warning is generated and reported.
In step S706, it is determined whether the performance status of the master node cluster is greater than a first status threshold (90%) and the duration of the performance status exceeds a second duration threshold.
In step S707, when the performance state of the master node cluster is greater than the first state threshold (90%) and the duration of the performance state exceeds the second duration threshold, one node is selected from the standby node cluster and added to the master node cluster.
Based on the same inventive concept, the invention also provides a distributed cluster capacity expansion device applied to the main node cluster. Wherein, the distributed storage system deployed by the main node cluster is also deployed with a standby node cluster.
Fig. 8 is a block diagram of a distributed cluster extension apparatus according to an exemplary embodiment. As shown in fig. 8, the distributed cluster capacity expansion apparatus includes a connection determining unit 801, a first time length determining unit 802, a screening unit 803, a monitoring unit 804, and a capacity expansion unit 805.
A connection determining unit 801, configured to determine whether a first node in the master node cluster is successfully connected.
A first duration determining unit 802, configured to determine whether a first duration of the first node in the unconnected state is greater than or equal to a first duration threshold when the first node cannot be connected.
A screening unit 803, configured to remove the first node when the first duration is greater than the first duration threshold.
The monitoring unit 804 is configured to monitor a performance state of the master node cluster after the first node is removed.
The capacity expansion unit 805 is configured to select a node from the standby node cluster and add the node to the master node cluster when the performance state is greater than the first state threshold and the duration of the performance state exceeds the second duration threshold.
In one embodiment, the screening unit 803 includes: and the node reservation unit is used for reserving the first node when the first time length is less than the first time length threshold value.
In another embodiment, the apparatus further comprises: and the generating unit is used for generating a first warning prompt for performance use warning when the performance state is greater than the second performance threshold, less than the first performance threshold and the performance state duration is greater than a third duration threshold. And the reporting unit is used for reporting the alarm prompt to the distributed storage system.
In yet another embodiment, the apparatus further comprises: and the selection unit is used for not selecting the node in the standby node cluster when the performance state is greater than the first performance threshold but the performance state duration does not exceed the second duration.
In yet another embodiment, the apparatus further comprises: and the counting unit is used for reconnecting the first node and counting the connection times when the first node is in the unconnected state. And the sending unit is used for sending a second alarm prompt that the first node is not connectable to the distributed storage system when the connection times are greater than the specified times.
In yet another embodiment, the statistical unit includes: and the accumulation unit is used for adding 1 to the connection times of the first node after the current connection of the first node fails. And the control unit is used for reconnecting the first node after the specified time interval.
In yet another embodiment, the performance state includes at least any one or more of the following: CPU utilization rate, memory utilization rate and hard disk utilization rate.
The specific limitations and beneficial effects of the distributed cluster expansion apparatus may refer to the limitations of the distributed cluster expansion method, which are not described herein again. The various modules described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 9 is a hardware configuration diagram of a computer device according to an exemplary embodiment. As shown in fig. 9, the apparatus includes one or more processors 910 and a storage 920, where the storage 920 includes a persistent memory, a volatile memory, and a hard disk, and one processor 910 is taken as an example in fig. 9. The apparatus may further include: an input device 930 and an output device 940.
The processor 910, the memory 920, the input device 930, and the output device 940 may be connected by a bus or other means, and fig. 9 illustrates an example of a connection by a bus.
Processor 910 may be a Central Processing Unit (CPU). The Processor 910 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 920 is a non-transitory computer readable storage medium, including a persistent memory, a volatile memory, and a hard disk, and can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the service management method in this embodiment of the present application. The processor 910 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 920, so as to implement any one of the above-described distributed cluster expansion methods.
The memory 920 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data used as needed or desired, and the like. Further, the memory 920 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 920 may optionally include memory located remotely from the processor 910, which may be connected to a data processing device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 930 may receive input numeric or character information and generate key signal inputs related to user settings and function control. The output device 940 may include a display device such as a display screen.
One or more modules are stored in the memory 920 and, when executed by the one or more processors 910, perform the methods illustrated in fig. 1-7.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. Details of the technique not described in detail in the present embodiment may be specifically referred to the relevant description in the embodiments shown in fig. 1 to 7.
Embodiments of the present invention further provide a non-transitory computer storage medium, where a computer-executable instruction is stored in the computer storage medium, and the computer-executable instruction may execute the authentication method in any of the above method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. A distributed cluster capacity expansion method is characterized in that the method is applied to a main node cluster, and a standby node cluster is also deployed in a distributed storage system deployed by the main node cluster; the method comprises the following steps:
judging whether the first node in the main node cluster is successfully connected;
when the first node cannot be connected, judging whether the first time length of the first node in the unconnected state is greater than or equal to a first time length threshold value;
when the first time length is larger than a first time length threshold value, removing the first node;
monitoring the performance state of the main node cluster after the first node is removed;
and when the performance state is greater than a first state threshold and the duration of the performance state exceeds a second duration threshold, selecting a node from the standby node cluster and adding the node to the main node cluster.
2. The method of claim 1, wherein prior to monitoring the performance status of the cluster of master nodes after removing the first node, the method further comprises:
and when the first time length is smaller than a first time length threshold value, reserving the first node.
3. The method according to claim 1 or 2, wherein after monitoring the performance status of the cluster of master nodes after removing the first node, the method further comprises:
when the performance state is larger than a second performance threshold, smaller than a first performance threshold and the duration of the performance state is larger than a third duration threshold, generating a first warning prompt for performance use warning;
and reporting the alarm prompt to the distributed storage system.
4. The method of claim 3, further comprising:
and when the performance state is larger than a first performance threshold but the duration of the performance state does not exceed the second duration, not selecting a node in the standby node cluster.
5. The method according to claim 1 or 2, wherein before determining whether the first duration that the first node is in the unconnected state is greater than or equal to a first duration threshold, the method further comprises:
reconnecting the first node when the first node is in an unconnected state, and counting connection times;
and when the connection times are more than the specified times, sending a second alarm prompt that the first node is not connectable to the distributed storage system.
6. The method according to claim 5, wherein the reconnecting the first node in the unconnected state and counting the number of connections comprises:
after the current connection of the first node fails, adding 1 to the connection times of the first node;
and after a specified time interval, reconnecting the first node.
7. The method of claim 1, wherein the performance state comprises at least any one or more of the following: CPU utilization rate, memory utilization rate and hard disk utilization rate.
8. A distributed cluster capacity expansion device is characterized in that the device is applied to a main node cluster, and a standby node cluster is also deployed in a distributed storage system deployed by the main node cluster; the device comprises:
a connection judging unit, configured to judge whether a first node in the master node cluster is successfully connected;
a first duration judging unit, configured to judge whether a first duration of an unconnected state of the first node is greater than or equal to a first duration threshold when the first node cannot be connected;
a screening unit, configured to remove the first node when the first duration is greater than a first duration threshold;
the monitoring unit is used for monitoring the performance state of the main node cluster after the first node is removed;
and the capacity expansion unit is used for selecting one node in the standby node cluster and adding the node to the main node cluster when the performance state is greater than a first state threshold and the duration of the performance state exceeds a second duration threshold.
9. A computer device comprising a memory and a processor, wherein the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the computer instructions to perform the distributed cluster capacity method according to any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the distributed cluster capacity method of any of claims 1-7.
CN202111275738.5A 2021-10-29 2021-10-29 Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium Active CN114168071B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111275738.5A CN114168071B (en) 2021-10-29 2021-10-29 Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111275738.5A CN114168071B (en) 2021-10-29 2021-10-29 Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium

Publications (2)

Publication Number Publication Date
CN114168071A true CN114168071A (en) 2022-03-11
CN114168071B CN114168071B (en) 2023-11-03

Family

ID=80477825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111275738.5A Active CN114168071B (en) 2021-10-29 2021-10-29 Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium

Country Status (1)

Country Link
CN (1) CN114168071B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115150273A (en) * 2022-06-30 2022-10-04 联想(北京)有限公司 Upgrade processing method and device
CN115361281A (en) * 2022-08-19 2022-11-18 浙江极氪智能科技有限公司 Processing method, device, equipment and medium for capacity expansion of multiple cloud cluster nodes

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108462760A (en) * 2018-03-21 2018-08-28 平安科技(深圳)有限公司 Electronic device, cluster access domain name automatic generation method and storage medium
CN108776579A (en) * 2018-06-19 2018-11-09 郑州云海信息技术有限公司 A kind of distributed storage cluster expansion method, device, equipment and storage medium
CN109361532A (en) * 2018-09-11 2019-02-19 上海天旦网络科技发展有限公司 The high-availability system and method and computer readable storage medium of network data analysis
WO2020134786A1 (en) * 2018-12-26 2020-07-02 中兴通讯股份有限公司 Server expansion method and device, server and storage medium
CN111464355A (en) * 2020-03-31 2020-07-28 北京金山云网络技术有限公司 Method and device for controlling expansion capacity of Kubernetes container cluster and network equipment
CN111756841A (en) * 2020-06-23 2020-10-09 中国平安财产保险股份有限公司 Service implementation method, device, equipment and storage medium based on micro-service cluster
CN113051075A (en) * 2021-03-23 2021-06-29 烽火通信科技股份有限公司 Kubernetes intelligent capacity expansion method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108462760A (en) * 2018-03-21 2018-08-28 平安科技(深圳)有限公司 Electronic device, cluster access domain name automatic generation method and storage medium
CN108776579A (en) * 2018-06-19 2018-11-09 郑州云海信息技术有限公司 A kind of distributed storage cluster expansion method, device, equipment and storage medium
CN109361532A (en) * 2018-09-11 2019-02-19 上海天旦网络科技发展有限公司 The high-availability system and method and computer readable storage medium of network data analysis
WO2020134786A1 (en) * 2018-12-26 2020-07-02 中兴通讯股份有限公司 Server expansion method and device, server and storage medium
CN111371583A (en) * 2018-12-26 2020-07-03 中兴通讯股份有限公司 Server capacity expansion method and device, server and storage medium
CN111464355A (en) * 2020-03-31 2020-07-28 北京金山云网络技术有限公司 Method and device for controlling expansion capacity of Kubernetes container cluster and network equipment
CN111756841A (en) * 2020-06-23 2020-10-09 中国平安财产保险股份有限公司 Service implementation method, device, equipment and storage medium based on micro-service cluster
CN113051075A (en) * 2021-03-23 2021-06-29 烽火通信科技股份有限公司 Kubernetes intelligent capacity expansion method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
蔡京军;刘晓宇;王珊珊;沈强;潘皓;: "城市轨道交通视频云存储系统架构及功能模块设计", 城市轨道交通研究, no. 12 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115150273A (en) * 2022-06-30 2022-10-04 联想(北京)有限公司 Upgrade processing method and device
CN115361281A (en) * 2022-08-19 2022-11-18 浙江极氪智能科技有限公司 Processing method, device, equipment and medium for capacity expansion of multiple cloud cluster nodes
CN115361281B (en) * 2022-08-19 2023-09-22 浙江极氪智能科技有限公司 Processing method, device, equipment and medium for expanding capacity of multiple cloud cluster nodes

Also Published As

Publication number Publication date
CN114168071B (en) 2023-11-03

Similar Documents

Publication Publication Date Title
CN113014634B (en) Cluster election processing method, device, equipment and storage medium
CN106533805B (en) Micro-service request processing method, micro-service controller and micro-service architecture
CN114168071B (en) Distributed cluster capacity expansion method, distributed cluster capacity expansion device and medium
US10846186B2 (en) Central processing unit CPU hot-remove method and apparatus, and central processing unit CPU hot-add method and apparatus
CN110830283B (en) Fault detection method, device, equipment and system
CN110740072B (en) Fault detection method, device and related equipment
WO2015169199A1 (en) Anomaly recovery method for virtual machine in distributed environment
CN111209110B (en) Task scheduling management method, system and storage medium for realizing load balancing
CN112671928B (en) Equipment centralized management architecture, load balancing method, electronic equipment and storage medium
US20080288812A1 (en) Cluster system and an error recovery method thereof
CN103902437A (en) Detecting method and server
CN107508694B (en) Node management method and node equipment in cluster
CN112769652B (en) Node service monitoring method, device, equipment and medium
CN104506392A (en) Downtime detecting method and device
CN102970167A (en) Method for detecting faults of network nodes in cluster system, network node and system
CN111885112A (en) Node service exception handling method, device, equipment and storage medium
CN115150460A (en) Node secure registration method, device, equipment and readable storage medium
CN113489149A (en) Power grid monitoring system service master node selection method based on real-time state perception
CN113765690A (en) Cluster switching method, system, device, terminal, server and storage medium
CN111309515B (en) Disaster recovery control method, device and system
CN108984602B (en) Database control method and database system
CN116483649A (en) Process monitoring method and device for passenger parking system, vehicle and storage medium
JP2007280155A (en) Reliability improving method in dispersion system
CN112269693B (en) Node self-coordination method, device and computer readable storage medium
CN113157493A (en) Backup method, device and system based on ticket checking system and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant