CN114546979A - Distributed storage system and management method, device and equipment thereof - Google Patents

Distributed storage system and management method, device and equipment thereof Download PDF

Info

Publication number
CN114546979A
CN114546979A CN202210158459.9A CN202210158459A CN114546979A CN 114546979 A CN114546979 A CN 114546979A CN 202210158459 A CN202210158459 A CN 202210158459A CN 114546979 A CN114546979 A CN 114546979A
Authority
CN
China
Prior art keywords
data
node
distributed storage
risk
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210158459.9A
Other languages
Chinese (zh)
Inventor
杨光
吴海英
刘德华
蒋宁
冯仕炳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Xiaofei Finance Co Ltd
Original Assignee
Mashang Xiaofei Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Xiaofei Finance Co Ltd filed Critical Mashang Xiaofei Finance Co Ltd
Priority to CN202210158459.9A priority Critical patent/CN114546979A/en
Publication of CN114546979A publication Critical patent/CN114546979A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a distributed storage system and a management method, a device and equipment thereof, wherein the method comprises the following steps: the method comprises the steps that an agent node obtains heartbeat data of a data node which is arranged in distributed storage equipment and corresponds to the agent node; if the data node is determined to have the risk of losing connection according to the heartbeat data, acquiring running state data of the distributed storage equipment; and sending the running state data to management equipment in the distributed storage system, wherein the running state data is used for carrying out early warning processing according to the running state data when the management equipment determines that the data nodes have the risk of losing connection. Through the embodiment of the application, the risk of data node loss of connection is reduced, and the management efficiency and the operation stability of the distributed storage system are improved.

Description

Distributed storage system and management method, device and equipment thereof
Technical Field
The present invention relates to the field of big data technologies, and in particular, to a distributed storage system, and a management method, apparatus, and device thereof.
Background
With the advent of the big data age, Hadoop, a distributed system infrastructure developed by the Apache Foundation, is widely used. An important component in Hadoop is HDFS (Hadoop Distributed File System, Distributed File storage System in Hadoop). The HDFS is composed of a Namenode and several datades, where the Namenode is a service responsible for managing metadata such as a name space and block information constituting a file, and the datade is a service responsible for storing the block information constituting the file and reporting health information of the Namenode and the stored block information to the Namenode through a heartbeat. The namenodes are deployed in the entity devices 1, each dataode is deployed in one entity device 2, and the entity devices 1 and 2 are different devices.
In practical applications, there may be a large task in the entity device 2 deployed with dataode that consumes CPU resources for a long time and outputs a large amount of data to a disk, and this makes dataode unable to send a heartbeat message to the Namenode within a preset interval, which results in the Namenode determining that dataode has been lost and printing the following information in a log of the Namenode: "BLOCK". REMOVDEADDATANode: lost heart from xxx ". To avoid a node loss, it is common in the prior art to collect a log of the node by the Kibana component and then send the log to the ElasticSearch component. The user queries the ElasticSearch component through a keyword ' BLOCK ' removeDeadDatanede ', if the keyword appears, it indicates that a large task exists in a distributed node where the corresponding Datanede is located, and the user can perform corresponding processing to avoid data disconnection. However, in this approach, Kibana and elastic search components need to be introduced, which increases the complexity of the whole technology stack and the operation and maintenance overhead. For the Namenode with a high load, a large amount of logs are output, and the large amount of logs may cause log loss when the Kibana component sends the logs to the ElasticSearch, so that subsequent processing is affected. In addition, the mode needs manual active query, so that the efficiency is low, and it is very likely that the dataode is already disconnected when the user queries, and the relevant processing cannot be performed before the dataode is disconnected.
Disclosure of Invention
The application provides a distributed storage system and a management method, a management device and management equipment thereof, so that the risk of data node loss of connection is reduced and the management efficiency and stability of the distributed storage system are improved on the basis of not increasing the hardware cost.
In a first aspect, an embodiment of the present application provides a management method for a distributed storage system, which is applied to a proxy node, and the method includes:
acquiring heartbeat data of data nodes deployed in distributed storage equipment and corresponding to the agent nodes;
if the data node is determined to have the risk of losing connection according to the heartbeat data, acquiring running state data of the distributed storage equipment;
and sending the running state data to management equipment in the distributed storage system, wherein the running state data is used for carrying out early warning processing according to the running state data when the management equipment determines that the data nodes have the risk of loss of connection according to the running state data.
In the embodiment of the application, the agent node is deployed in the distributed storage device corresponding to the data node, so that the agent node can automatically acquire heartbeat data of the data node, and when it is determined that the data node has an offline risk according to the heartbeat data, the operation state data of the distributed storage device in which the agent node is located is acquired and sent to the management device in the distributed storage system; therefore, when the management equipment determines that the data nodes have the risk of losing connection according to the state data, early warning processing is carried out according to the operation state data. Whether the data nodes have automatic identification and automatic early warning of the loss of the link risk is achieved in the process, and human participation is not needed, so that the identification efficiency is improved, the loss of the link risk of the data nodes is reduced, and the management efficiency and the operation stability of the distributed storage system are improved. And because a plurality of components are not required to be introduced in the process, the complexity of the technology and the operation and maintenance cost are reduced.
In a second aspect, an embodiment of the present application provides a management method for a distributed storage system, which is applied to a management device, and the method includes:
receiving operation state data of distributed storage equipment where the agent node is located, wherein the operation state data is sent by the agent node; the operating state data is sent by the agent node when the agent node determines that the data node deployed in the distributed storage equipment and corresponding to the agent node has a risk of losing connection;
determining whether the data node has a risk of loss of connection according to the running state data;
and under the condition that the data node has the risk of losing the connection, early warning processing is carried out according to the running state data.
In the embodiment of the application, the agent node is deployed in the distributed storage device corresponding to the data node, so that the agent node can send the running state data of the distributed storage device to the management device in the distributed storage system when determining that the data node has an offline risk; and when the management equipment determines that the data nodes have the risk of losing connection according to the state data, early warning processing is carried out according to the operation state data. Whether the data nodes have automatic identification and automatic early warning of the loss risk or not is achieved in the process, the phenomenon that data are lost does not exist, and human participation is not needed, so that the identification efficiency is improved, the loss risk of the data nodes is reduced, and the management efficiency and the running stability of a distributed storage system are improved. And because a plurality of components are not required to be introduced in the process, the complexity of the technology and the operation and maintenance cost are reduced.
In a third aspect, an embodiment of the present application provides a management apparatus for a distributed storage system, where the management apparatus is applied to a proxy node, and the apparatus includes:
the first acquisition module is used for acquiring heartbeat data of data nodes which are deployed in the distributed storage equipment and correspond to the agent nodes;
the second acquisition module is used for acquiring the running state data of the distributed storage equipment if the data node is determined to have the risk of losing connection according to the heartbeat data;
and the sending module is used for sending the running state data to the management equipment in the distributed storage system, and the running state data is used for carrying out early warning processing according to the running state data when the management equipment determines that the data node has the risk of loss of connection according to the running state data.
In a fourth aspect, an embodiment of the present application provides a management apparatus for a distributed storage system, where the management apparatus is applied to a management device, and the apparatus includes:
the receiving module is used for receiving the running state data of the distributed storage equipment where the proxy node is located, and the running state data is sent by the proxy node; the operating state data is sent by the agent node when the agent node determines that the data node deployed in the distributed storage equipment and corresponding to the agent node has the risk of losing connection;
the determining module is used for determining whether the data node has the risk of loss of connection according to the running state data;
and the early warning module is used for carrying out early warning processing according to the operation state data under the condition that the data node has the risk of losing the link.
In a fifth aspect, an embodiment of the present application provides a distributed storage system, including: a management device and a plurality of distributed storage devices; data nodes and agent nodes corresponding to the data nodes are deployed in the distributed storage equipment;
the agent node is used for acquiring heartbeat data of the correspondingly deployed data nodes; if the data node is determined to have the risk of losing connection according to the heartbeat data, acquiring running state data of the distributed storage equipment; sending the running state data to the management equipment;
the management equipment is used for determining whether the data node has the risk of loss of connection according to the received running state data; and under the condition that the data node has the risk of losing the connection, early warning processing is carried out according to the running state data.
In a sixth aspect, an embodiment of the present application provides an electronic device, including:
a processor; and the number of the first and second groups,
a memory arranged to store computer executable instructions configured for execution by the processor, the executable instructions comprising instructions for performing the steps in the management method of the distributed storage system described above.
In a seventh aspect, an embodiment of the present application provides a storage medium, where the storage medium is used to store computer-executable instructions, and the computer-executable instructions cause a computer to execute the management method of the distributed storage system.
Drawings
In order to more clearly illustrate one or more embodiments of the present application or technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without inventive exercise.
Fig. 1 is a schematic view of a scenario of a management method of a distributed storage system according to an embodiment of the present application;
fig. 2 is a first flowchart illustrating a management method of a distributed storage system according to an embodiment of the present application;
fig. 3 is a second flowchart illustrating a management method of a distributed storage system according to an embodiment of the present application;
fig. 4 is a third flowchart illustrating a management method of a distributed storage system according to an embodiment of the present application;
fig. 5 is a fourth flowchart illustrating a management method of a distributed storage system according to an embodiment of the present application;
fig. 6 is a fifth flowchart illustrating a management method of a distributed storage system according to an embodiment of the present application;
fig. 7 is a sixth flowchart illustrating a management method of a distributed storage system according to an embodiment of the present application;
fig. 8 is a schematic diagram illustrating a first module composition of a management apparatus of a distributed storage system according to an embodiment of the present application;
fig. 9 is a schematic diagram illustrating a second module composition of a management apparatus of a distributed storage system according to an embodiment of the present application;
fig. 10 is a schematic composition diagram of a distributed storage system according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an electronic device according to one or more embodiments of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in one or more embodiments of the present application, the technical solutions in one or more embodiments of the present application will be clearly and completely described below with reference to the drawings in one or more embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from one or more embodiments of the present application without inventive step shall fall within the scope of protection of this document.
Fig. 1 is a schematic view of an application scenario of a distributed storage system according to one or more embodiments of the present application, as shown in fig. 1, the scenario includes: a management device and P distributed storage devices, P being an integer greater than 1; data nodes (dataode) and Agent nodes (Agent) corresponding to the data nodes are deployed in the distributed storage device. The management device and the distributed storage device may be terminal devices, which may be mobile phones, tablet computers, desktop computers, portable notebook computers, and the like. The management device and the distributed storage device may also be a server, such as an independent server, or a server cluster composed of a plurality of servers.
Specifically, the agent node acquires heartbeat data of a data node correspondingly deployed in the distributed storage device, and determines whether the data node has an offline risk according to the heartbeat data; and if the data node is determined to have the risk of loss of connection, acquiring the running state data of the distributed storage equipment and sending the running state data to the management equipment. And the management equipment determines whether the data node has an offline risk according to the received state data, and performs early warning processing according to the operation state data under the condition that the data node has the offline risk. Therefore, automatic identification and automatic early warning whether the data nodes have the loss risk or not are achieved in the process, and human participation is not needed, so that the identification efficiency is improved, the loss risk of the data nodes is reduced, and the management efficiency and the running stability of the distributed storage system are improved. And because a plurality of components are not required to be introduced in the process, the complexity of the technology and the operation and maintenance cost are reduced.
Based on the application scenario architecture, the embodiment of the application provides a management method of a distributed storage system. Fig. 2 is a flowchart illustrating a management method of a distributed storage system according to one or more embodiments of the present application, where the method in fig. 2 can be executed by a proxy node in fig. 1, as shown in fig. 2, and the method includes the following steps:
step S102, acquiring heartbeat data of data nodes deployed in distributed storage equipment corresponding to the agent nodes;
the distributed storage system provided by the present application is a system obtained by improving an existing HDFS, and the distributed storage system provided by the present application may further include a name node (Namenode) in addition to the management device and the distributed storage device shown in fig. 1, where the name node may be deployed on a device different from both the management device and the distributed storage device. The data node and the name node in the distributed storage system provided by the application are the same as those in the existing HDFS, are software services, and retain the functions of the data node and the name node in the existing HDFS, namely, the data node sends heartbeat messages to the name node according to a preset first time interval. When determining that the preset heartbeat data acquisition condition is met, the agent node acquires heartbeat data corresponding to the heartbeat message sent by a data node which is deployed in the distributed storage equipment and corresponds to the agent node. It should be noted that, since the management method of the distributed storage system provided by the present application does not involve the related operations of the name node, the name node is not shown in fig. 1.
Step S104, if the data node is determined to have the risk of losing connection according to the heartbeat data, acquiring running state data of the distributed storage equipment;
specifically, if the agent node determines that the data node has the risk of losing connection according to the acquired heartbeat data, a designated analysis tool is started, and the running state data output by the analysis tool is acquired. The analysis tool is used for detecting and outputting the running state data of the distributed storage device. In one embodiment, the analysis tool may be an iotop tool, and the broker node may start the iotop tool by a start command iotop-botq-iter ═ 3; after the iotop tool is started, the running state data of the distributed storage device where the iotop tool is located are detected, and the detected running state data are output according to a preset output path. After the agent node starts the IOtop tool, an output path of the running state data detected by the IOtop tool can be obtained, and the running state data output by the IOtop tool can be obtained according to the output path. It should be noted that the manner of acquiring the operation state data by the proxy node is not limited to the foregoing manner, and may be set by itself in practical application as needed.
The running state data may include a current read operation data amount (unit: M/S) and a current write operation data amount (unit: M/S) of the system of the distributed storage device, a read operation data amount and a write operation data amount of each currently started process in the distributed storage device, a disk usage rate of each currently started process in the distributed storage device, a read operation data amount and a write operation data amount of each YARN application currently running in the distributed storage device, a read operation data amount and a write operation data amount of each YARN task currently running in the distributed storage device, and the like. In Hadoop, YARN applications may be resource scheduling engines, each YARN application may run at least one YARN task, and each YARN task may correspond to a YARN container.
And step S106, sending the running state data to the management equipment in the distributed storage system, wherein the running state data is used for carrying out early warning processing according to the running state data when the management equipment determines that the corresponding data node has the risk of losing connection according to the running state data.
In order to enable the management device to quickly determine whether the corresponding data node has a risk of losing connection according to the operation state data, in one or more embodiments of the present application, the agent node sends the acquired operation state data and the device information of the distributed storage device to the management device together. Specifically, step S106 may include: and sending the running state data and the device information of the distributed storage device to a management device in the distributed storage system based on an HTTP protocol. Wherein, the device information may be a device identifier of the distributed storage device; the device identifier may be a device serial number, and the device identifier may also be a device identifier that is previously allocated to the distributed storage device according to an allocation manner of the device identifier. The processing procedure of the management device after receiving the operation status data can be referred to the related description below.
In one or more embodiments of the application, by deploying the agent node in the distributed storage device corresponding to the data node, the agent node can send the running state data of the distributed storage device to the management device in the distributed storage system when determining that the data node has a risk of losing connection; and when the management equipment determines that the data nodes have the risk of losing connection according to the state data, early warning processing is carried out according to the operation state data. Whether the data nodes have automatic identification and automatic early warning of the loss risk or not is achieved in the process, the phenomenon that data are lost does not exist, and human participation is not needed, so that the identification efficiency is improved, the loss risk of the data nodes is reduced, and the management efficiency and the running stability of a distributed storage system are improved. And because a plurality of components are not required to be introduced in the process, the complexity of the technology and the operation and maintenance cost are reduced.
In order to timely discover a data node with an offline risk, in one or more embodiments of the present application, each agent node may actively acquire heartbeat data of a correspondingly deployed data node according to a preset second time interval. Accordingly, step S102 may include the following steps S102-2 and S102-4:
step S102-2, sending a heartbeat data acquisition request to a data node which is deployed in the distributed storage equipment and corresponds to the agent node based on JMX according to a preset second time interval;
specifically, when determining that the heartbeat data acquisition time corresponding to the preset second time interval is reached, the proxy node determines that the preset heartbeat data acquisition condition is met, and sends a heartbeat data acquisition request to a data node, which is deployed in the distributed storage device and corresponds to the proxy node, based on JMX. Among them, JMX (Java Management Extensions) is a framework in which Management functions are embedded for applications, devices, systems, and the like. JMX can span a series of heterogeneous operating system platforms, system architectures and network transport protocols, and can flexibly develop seamlessly integrated system, network and service management applications. The data transmission method based on JMX can refer to the prior art, and is not described in detail in this application.
The duration of the second time interval is greater than the duration of the first time interval, for example, the duration of the first time interval is 10 seconds, the duration of the second time interval is 30 seconds or 1 minute, and the like, which can be set in practical application as required.
And step S102-4, receiving heartbeat data sent by the data node based on JMX.
And generating a corresponding sending log by the data node every time the data node sends the heartbeat message. When the data node receives a heartbeat data acquisition request sent by the proxy node, acquiring heartbeat data from a sending log of a heartbeat message and sending the heartbeat data to the proxy node based on JMX, wherein the proxy node receives the heartbeat data sent by the data node. The specific content of the heartbeat data may be different according to different ways of determining whether the data node has the risk of losing contact by the agent node, which may be specifically referred to in the following related description.
In one or more embodiments of the present application, the management device may further send a status query request to each proxy node based on an HTTP protocol according to a preset second time interval, so as to actively query an operating status of each data node. Accordingly, step S102 may include the following steps S102-6 and S102-8:
step S102-6, if a state query request sent by the management equipment is received, sending a heartbeat data acquisition request to a data node which is deployed in the distributed storage equipment and corresponds to the agent node based on JMX;
specifically, when receiving a state query request sent by the management device, the agent node determines that a preset heartbeat data acquisition condition is met, and sends a heartbeat data acquisition request to a data node, which is deployed in the distributed storage device and corresponds to the agent node, based on JMX.
And step S102-8, receiving heartbeat data sent by the data node based on JMX.
Corresponding to step S102-6 and step S102-8, the method may further include: and if the proxy node determines that the corresponding data node has no loss risk according to the heartbeat data, the proxy node sends response data representing that the data node has no loss risk to the management equipment based on the HTTP.
Whether the agent node actively acquires heartbeat data of the data node according to the second time interval or the management node actively inquires the running state of the data node according to the second time interval, the risk of losing connection of the data node can be timely found, and therefore relevant processing is conducted, and the running stability of the distributed storage system is guaranteed.
Considering that when a large task runs in a distributed storage device where a data node is located, the time for consuming CPU resources is more than 20 minutes, and the quantity of output data is more than 50G, at this time, the data node usually has a risk of losing connection, and the sending times of heartbeat messages of the data node may change accordingly. Based on this, in one or more embodiments of the present application, the agent node determines whether the data node has a risk of losing connection according to the cumulative sending times of the heartbeat messages after the data node is self-started. Specifically, as shown in fig. 3, step S102 may further include the following step S103-2 to step S103-6, and step S104 may include the following step S104-2 corresponding to step S103-2 to step S103-6:
step S103-2, determining a first accumulated time according to the heartbeat data, wherein the first accumulated time is the time of sending the heartbeat message by the data node in a first time period, the initial time of the first time period is the starting time of the data node, and the ending time of the first time period is the acquisition time of the currently acquired heartbeat data;
optionally, the heartbeat data includes a first accumulated number of times; accordingly, step S103-2 includes: a first cumulative number of times is obtained from the heartbeat data. Or the heartbeat data comprises a sending record of each heartbeat message of the data node in a first time interval; accordingly, step S103-2 includes: and counting the first accumulated times according to the sending records of the heartbeat messages included in the heartbeat data.
Step S103-4, reading a second stored accumulated time, wherein the second accumulated time is the time of sending the heartbeat message by the data node in a second time period, the starting time of the second time period is the starting time of the data node, and the ending time of the second time period is the last time of acquiring the heartbeat data by the proxy node;
specifically, when the agent node acquires the first accumulated times for the first time, the acquired first accumulated times are saved; and when the heartbeat data of the data node is acquired next time, determining the currently stored first accumulated times as second accumulated times, and reading the second accumulated times. And determining a first accumulated time according to the heartbeat data acquired next time, updating the stored second accumulated time to the determined first accumulated time under the condition of determining that the first accumulated time is different from the second accumulated time, and so on. For example, according to the chronological order, t1, t2, t3, and t4 are the acquisition times at which the proxy node acquires the heartbeat data, respectively, and the acquisition time of the currently acquired heartbeat data is t4, the proxy node stores the second accumulated number of times of the heartbeat messages sent from the data node self-starting time to t 3.
Step S103-6, determining whether the data node has the risk of loss of connection according to the first accumulated times and the second accumulated times;
considering that when the number of times of sending the heartbeat message by the data node in the second time interval is zero, it is generally characterized that the data node has a risk of losing connection, in one or more embodiments of the present application, step S103-6 may include: and under the condition that the first accumulative times are the same as the second accumulative times, determining that the data node has the risk of losing connection.
Specifically, the agent node determines whether the first accumulated number is the same as the second accumulated number, and determines that the data node has a risk of loss of connection under the condition that the first accumulated number is the same as the second accumulated number; and under the condition that the first accumulative times are larger than the second accumulative times, determining that the data node has no risk of losing connection.
Further, it is considered that in practical applications, when the number of times of sending the heartbeat message in the second time interval by the data node is small, there may be a risk of losing connection. Based on this, in one or more embodiments of the present application, the management node may further determine a reference number of times of sending the heartbeat message, and the agent node obtains the reference number of times from the management node, and determines whether the data node has an offline risk according to the first cumulative number, the second cumulative number, and the reference number of times. Specifically, step S103-6 may further include: sending a reference frequency acquisition request to the management equipment, and receiving the reference frequency sent by the management equipment; the reference times can be determined empirically in advance and preset in the management device, or can be determined by the management device according to a preset mode, which can be set in practical application as required. Accordingly, step S103-6 may include: calculating the difference value of the first accumulation times and the second accumulation times; determining whether the calculated time difference is smaller than the obtained reference time, and determining that the data node has the risk of loss of connection under the condition that the time difference is smaller than the reference time; and under the condition that the time difference is not less than the reference time, determining that the data node has no risk of loss of connection. For example, if the first cumulative number is 30 times, the second cumulative number is 26 times, and the reference number is 5 times, the difference between the calculated numbers is 30-26 times to 4 times, which is less than the reference number 5 times, and it is determined that the data node has the risk of loss of connection.
Further, in step S103-6, in the case that the agent node determines that the data node does not have the risk of losing contact, the method further includes: and replacing the saved second accumulated times with the first accumulated times, and returning to the step S102-2 or the step S102-6.
And step S104-2, acquiring the running state data of the distributed storage equipment under the condition that the data node has the risk of loss of connection.
Therefore, the agent node can determine whether the data node has the risk of losing the connection according to the accumulated sending times of the heartbeat messages of the data node, and further processing can be carried out when the risk of losing the connection is determined to ensure the operation stability of the distributed storage system.
In one or more embodiments of the present application, the agent node may further determine whether the data node has a risk of losing connection according to the sending times of the heartbeat message of the data node in the third time period. Specifically, as shown in fig. 4, step S102 may further include the following step S103-8 and step S103-10, and correspondingly, step S104 may include the aforementioned step S104-2:
step S103-8, determining the sending times of the heartbeat messages of the data nodes in the third time period according to the heartbeat data; the starting time of the third time interval is the time when the proxy node acquires heartbeat data last time, and the ending time of the third time interval is the acquiring time of the currently acquired heartbeat data;
optionally, the heartbeat data includes the number of times the data node sends the heartbeat message in a third time period (e.g. the time period from t3 to t4 in the foregoing example); correspondingly, the proxy node acquires the sending times of the heartbeat message of the data node in the third period from the heartbeat data. Or the heartbeat data comprises a sending record of the heartbeat message of the data node in a third time interval; correspondingly, the agent node counts the sending times of the heartbeat message of the data node in the third time period according to the sending record included in the heartbeat data.
Step S103-10, determining whether the data node has the risk of loss of connection according to the sending times;
optionally, it is determined whether the sending number is zero, and in the case that the sending number is zero, it is determined that the data node has a risk of loss of connection. When the sending frequency of the heartbeat messages determined according to the heartbeat data is zero, the representation data nodes do not send the heartbeat messages in the third time period, and the data nodes can be determined to have the risk of loss of connection.
Or, before step S103-10, the method may further include: and sending a reference frequency acquisition request to the management equipment, and receiving the reference frequency sent by the management equipment. Accordingly, step S103-10 may include: determining whether the sending times are smaller than the acquired reference times, and determining that the data node has the risk of loss of connection under the condition that the sending times are smaller than the reference times; and under the condition that the sending times are not less than the reference times, determining that the data node has no risk of loss of connection.
Further, when it is determined that the data node has no risk of loss of contact, the step returns to the step S102-2 or the step S102-6.
Therefore, the agent node can determine whether the data node has the risk of losing the connection according to the sending times of the heartbeat message of the data node in the third time period, so that further processing can be carried out when the risk of losing the connection is determined to exist, and the operation stability of the distributed storage system is guaranteed.
In one or more embodiments of the present application, the data node may be controlled by the proxy node to perform data storage processing, and specifically, the method may further include:
if the data to be stored is obtained, sending a data storage request to a corresponding data node based on JMX, wherein the data storage request is used for requesting the data node to perform storage processing on the data to be stored; and receiving the storage result sent by the data node based on JMX.
The data to be stored is obtained by receiving the data to be stored sent by the client, or receiving the data to be stored sent by other equipment; correspondingly, when receiving the storage result sent by the data node, the proxy node may also send the storage result to the client or the other device. The process of the data node for performing storage processing on the data to be stored can refer to the existing storage process of the data node for the data to be stored, and detailed description is omitted in this application.
In one or more embodiments of the application, the agent node is deployed in the distributed storage device corresponding to the data node, so that the agent node can automatically acquire heartbeat data of the data node, and when it is determined that the data node has an offline risk according to the heartbeat data, the operation state data of the distributed storage device where the agent node is located is acquired and sent to the management device in the distributed storage system, so that the management device performs early warning processing according to the operation state data when it is determined that the data node has an offline risk according to the state data. Whether the data nodes have automatic identification and automatic early warning of the loss of the link risk is achieved in the process, and human participation is not needed, so that the identification efficiency is improved, the loss of the link risk of the data nodes is reduced, and the management efficiency and the running stability of the distributed storage system are improved. And because a plurality of components are not required to be introduced in the process, the complexity of the technology and the operation and maintenance cost are reduced.
On the basis of the same technical concept, corresponding to the management method of the distributed storage system described above, one or more embodiments of the present application further provide another management method of the distributed storage system, fig. 5 is a flowchart illustrating the management method of another distributed storage system provided by one or more embodiments of the present application, and the method in fig. 5 can be executed by the management apparatus in fig. 1; as shown in fig. 5, the method comprises the steps of:
step S202, receiving the running state data of the distributed storage equipment where the agent node is located, which is sent by the agent node; the running state data is sent when the agent node determines that the data node which is arranged in the distributed storage equipment and corresponds to the agent node has the risk of losing connection;
specifically, when determining that a preset heartbeat data acquisition condition is met, the agent node acquires heartbeat data of a data node which is deployed in the distributed storage node and corresponds to the agent node, determines whether the data node has a loss risk according to the acquired heartbeat data, and if yes, acquires running state data of the distributed storage node and sends the acquired running state data to the management equipment based on an HTTP (hyper text transport protocol); and the management equipment receives the running state data sent by the agent node.
Step S204, determining whether the corresponding data node has the risk of loss of connection according to the running state data;
in order to improve the effectiveness of early warning processing, when the management equipment receives the running state data sent by the agent node, whether the corresponding data node has an offline risk or not is determined according to the running state data in a preset mode; and when the data node is determined to have the risk of losing the connection, early warning processing is carried out according to the operation state data.
And step S206, performing early warning processing according to the operation state data under the condition that the data node has the risk of losing contact.
In one or more embodiments of the application, the agent node is deployed in the distributed storage device corresponding to the data node, so that the agent node can send the running state data of the distributed storage device to the management device in the distributed storage system when determining that the data node has an offline risk; and when the management equipment determines that the data nodes have the risk of losing connection according to the state data, early warning processing is carried out according to the operation state data. Whether the data nodes have automatic identification and automatic early warning of the loss risk or not is achieved in the process, the phenomenon that data are lost does not exist, and human participation is not needed, so that the identification efficiency is improved, the loss risk of the data nodes is reduced, and the management efficiency and the running stability of a distributed storage system are improved. And because a plurality of components are not required to be introduced in the process, the complexity of the technology and the operation and maintenance cost are reduced.
In order to enable the agent node to determine whether there is a risk of losing connection with the corresponding data node according to the acquired heartbeat data, in one or more embodiments of the present application, step S202 may further include: receiving a reference frequency acquisition request sent by an agent node; and sending the determined reference times to the agent node, wherein the reference times are used for determining whether the data node has the risk of losing the link by the agent node.
Specifically, the management node receives a reference number acquisition request sent by the proxy node based on an HTTP protocol; the management node acquires preset reference times or determines the reference times according to a preset mode, and sends the reference times to the proxy node based on an HTTP (hyper text transport protocol); and when the agent node determines that the data node has the risk of losing connection according to the reference times and the acquired heartbeat data of the data node, the acquired running state data of the distributed storage equipment is sent to the management equipment based on the HTTP.
Further, in order to facilitate the management device to quickly determine whether the data node has a risk of losing connection, in one or more embodiments of the present application, the agent node sends the device information and the running state data of the distributed storage device where the agent node is located to the management device together; and the management equipment generates an operation record of the distributed storage equipment according to the received equipment information and the operation state data, and determines whether the corresponding data node has the risk of loss of connection or not according to the operation record. Specifically, as shown in fig. 6, step S202 may include the following step S202-2:
and step S202-2, receiving the running state data and the device information of the distributed storage device where the proxy node is located, which are sent by the proxy node.
Specifically, the receiving proxy node receives the running state data and the device information of the distributed storage device where the receiving proxy node is located, which are sent based on the HTTP protocol. The device information may be a device identifier of the distributed storage device; the device identifier may be a device serial number, or a device identifier previously allocated to the distributed storage device according to the allocation manner of the device identifier.
Corresponding to step S202-2, as shown in fig. 6, step S204 may include the following steps S204-2 to S204-10, and step S206 includes the following steps S206-2 and S206-4:
step S204-2, determining the receiving time of the running state data and the equipment information;
specifically, when the management device receives the operation state data and the device information sent by the proxy node, the system time of the management device is obtained, and the obtained system time is determined as the receiving time of the operation state data and the device information.
Step S204-4, generating and storing the operation record of the distributed storage equipment according to the determined receiving time, the received operation state data and the equipment information;
and after the management equipment saves the generated operation record, determining whether the saved operation record of the distributed storage equipment meets a preset condition, and under the condition that the saved operation record of the distributed storage equipment meets the preset condition, determining that the corresponding data node has the risk of loss of connection. The step S206-6 to the step S206-10 may be implemented to determine whether the saved operation records of the distributed storage device satisfy the preset condition, and when the number of the target operation records in the step S206-10 is greater than the preset number, it is determined that the preset condition is satisfied, that is, it is determined that the corresponding data node has the risk of losing connection. The specific format of the operation record can be set in the actual application according to the requirement.
Step S204-6, determining the receiving time as the ending time point of the preset time length, and determining the starting time point of the preset time length;
the preset time period can be set in practical application according to needs, for example, 10 minutes.
Step S204-8, counting the number of target operation records with the time between the starting time point and the ending time point in the stored operation records of the corresponding distributed storage equipment according to the equipment information;
specifically, the associated operation records are screened from the stored operation records according to the received device information, and the screened operation records are determined as the operation records of the corresponding distributed storage device. Determining whether the receiving time of each running record of the distributed storage equipment is between a determined starting time point (including) and a determined ending time point (including), and if so, determining the corresponding running record as a target running record; and counting the number of target running records.
Step S204-10, determining that the corresponding data nodes have the risk of loss of connection under the condition that the number of the target operation records is larger than the preset number;
the preset amount can be set in practical application according to needs, for example, the preset amount is 5 minutes.
Step S206-2, generating early warning data of the distributed storage equipment according to the target operation record;
wherein the early warning data comprises one or more of: the method comprises the steps of obtaining the total read-write speed of the distributed storage equipment, the N processes with the highest disk utilization rate in the distributed storage equipment, the N processes with the highest read-write speed in the distributed storage equipment, the N YARN applications with the highest read-write speed in the distributed storage equipment and the N YARN tasks with the highest read-write speed in the distributed storage equipment, wherein N is an integer. The early warning data may further include device information of the distributed storage device, generation time of the early warning data, and the like. The specific content and form of the early warning data can be set automatically according to the requirement in practical application.
Specifically, determining the total read-write speed of the corresponding distributed storage device according to the target operation record may include: and acquiring the read-write speed of the distributed storage equipment from each target running record, calculating an average value of the acquired read-write speed, and determining the calculated average value as the total read-write speed of the distributed storage equipment.
Determining the N processes with the highest disk utilization rate in the corresponding distributed storage device according to the target operation record, which may include: and acquiring the disk utilization rate of each process operated in the distributed storage equipment from each target operation record, calculating the average disk utilization rate aiming at each operated process according to the acquired disk utilization rate, sequencing the average disk utilization rates of the processes to obtain the highest N average disk utilization rates, and determining the process corresponding to the highest N average disk utilization rates as the N processes with the highest disk utilization rates.
Determining the N processes with the maximum read-write speed in the corresponding distributed storage device according to the target operation record, which may include: acquiring the read-write speed of each process running in the distributed storage equipment from each target running record, calculating the average read-write speed according to the acquired read-write speed aiming at each running process, sequencing the average read-write speed of each process to obtain the highest N average read-write speeds, and determining the process corresponding to the highest N average read-write speeds as the N processes with the highest read-write speed.
Determining the N YARN applications with the highest read-write speed in the corresponding distributed storage device according to the target operation record may include: the method comprises the steps of obtaining the read-write speed of each YARN application running in the distributed storage equipment from each target running record, calculating the average read-write speed according to the obtained read-write speed aiming at each running YARN application, carrying out sequencing processing on the average read-write speed of each YARN application to obtain the highest N average read-write speeds, and determining the YARN application corresponding to the highest N average read-write speeds as the N YARN applications with the highest read-write speeds.
Determining the N YARN tasks with the highest read-write speed in the corresponding distributed storage device according to the target operation record may include: the method comprises the steps of obtaining the read-write speed of each YARN task running in the distributed storage equipment from each target running record, calculating the average read-write speed according to the obtained read-write speed aiming at each YARN task running, carrying out sequencing processing on the average read-write speed of each YARN task to obtain the highest N average read-write speeds, and determining the YARN task corresponding to the highest N average read-write speeds as the N YARN tasks with the highest read-write speeds. The method includes the steps that N YARN tasks with the highest read-write speed in corresponding distributed storage equipment are determined according to target operation records, and N YARN containers with the highest read-write speed in corresponding distributed storage equipment can also be determined according to the target operation records. The process of determining the N YARN containers with the highest read-write speed in the corresponding distributed storage device according to the target operation record is the same as the process of determining the N YARN tasks with the highest read-write speed in the corresponding distributed storage device according to the target operation record, and repeated points are not repeated here.
And S206-4, storing the early warning data and sending the early warning data to a designated administrator, wherein the early warning data is used for the administrator to determine the large tasks running in the distributed storage equipment and to perform task ending processing on the large tasks.
Specifically, the early warning data is stored in a designated database, so that the early warning data of each distributed storage device in the distributed storage system is uniformly managed through the database, and other related processing is performed based on each early warning data in the database.
Further, the sending of the early warning data to the designated administrator may be sending the early warning data to a mailbox of the designated administrator by a mail, or to a mobile phone number of the designated administrator by a short message, or sending the early warning data to the designated system, and sending a notification message to the designated administrator by a mail or by a short message, where the notification message is used to notify the administrator to refer to the early warning data in the designated system. When the administrator looks up the early warning data, the administrator can determine the large tasks in the corresponding distributed storage devices based on the early warning data according to the operation and maintenance experience, for example, the YARN application with the highest read-write speed can be determined as the large tasks, and the administrator can perform rechecking processing based on other data in the early warning data to determine the final large tasks; the determination method of the large task is not specifically limited in the present application.
In practical applications, the YARN application with high read/write speed running in the distributed storage device usually has a large influence on the data nodes, which easily causes the data nodes to be disconnected. Based on this, in order to ensure effective operation of the data node, in one or more embodiments of the present application, the early warning data includes N YARN applications with the highest read/write speed running in the distributed storage device, and accordingly, after step S206-2, the method may further include:
determining application names of M target YARN applications with highest read-write speed in the N YARN applications included in the early warning data, wherein M is an integer and is smaller than N; and under the condition that the determined application name is contained in the preset white list, stopping the running of the target YARN application corresponding to the contained application name.
Optionally, the early warning data includes an application name and an application identifier of each YARN application of the N YARN applications with the highest read-write speed; correspondingly, the management equipment reads the application names and application identifications of the M target YARN applications with the highest read-write speed from the early warning data; the management equipment determines whether the read application name is contained in the preset white list, and calls a resource manager of the YARN application according to the application identifier corresponding to the application name under the condition that the read application name is contained in the white list, and the YARN application corresponding to the application identifier is stopped running through the resource manager of the YARN application. Or the early warning data comprises an application identifier of each YARN application in the N YARN applications with the highest read-write speed; correspondingly, the management equipment reads the application identifiers of the M target YARN applications with the highest read-write speed from the early warning data, and calls a resource manager of the YARN application to inquire the application names corresponding to the application identifiers according to the read application identifiers; the management equipment determines whether the preset white list contains the inquired application name, and in the case that the white list contains the inquired application name, the resource manager of the YARN application is called again according to the application identifier corresponding to the application name, and the target YARN application corresponding to the application identifier is stopped through the resource manager of the YARN application.
In one or more embodiments of the application, the agent node is deployed in the distributed storage device corresponding to the data node, so that the agent node can send the running state data of the distributed storage device to the management device in the distributed storage system when determining that the data node has an offline risk; and when the management equipment determines that the data nodes have the risk of losing connection according to the state data, early warning processing is carried out according to the operation state data. Whether the data nodes have automatic identification and automatic early warning of the loss risk or not is achieved in the process, the phenomenon that data are lost does not exist, and human participation is not needed, so that the identification efficiency is improved, the loss risk of the data nodes is reduced, and the management efficiency and the running stability of the distributed storage system are improved. And because a plurality of components are not required to be introduced in the process, the complexity of the technology and the operation and maintenance cost are reduced.
In a specific embodiment, taking an example that the proxy node determines whether the data node has the risk of losing connection according to the cumulative sending times and the reference times of the heartbeat messages, and the proxy node acquires the reference times after being started, as shown in fig. 7, the method may include:
step S302, starting a proxy node, and sending a reference time acquisition request to the management equipment based on the HTTP;
step S304, the management equipment receives the reference frequency acquisition request, determines the reference frequency and sends the reference frequency to the proxy node based on the HTTP;
step S306, the agent node receives the reference times sent by the management equipment, and sends a heartbeat data acquisition request to the correspondingly deployed data node based on JMX according to a preset second time interval;
step S308, the data node sends heartbeat data of the data node to the proxy node based on JMX according to the received heartbeat data acquisition request;
step S310, the agent node determines a first accumulated number of times according to the received heartbeat data;
the first accumulated times are the times of sending heartbeat messages by the data node in a first time period, the starting time of the first time period is the starting time of the data node, and the ending time of the first time period is the acquiring time of the currently acquired heartbeat data.
Step S312, the agent node reads the saved second accumulated times and calculates the difference between the first accumulated times and the second accumulated times; determining that the data node has the risk of loss of connection under the condition that the time difference is smaller than the reference time;
the second accumulated times are the times of sending the heartbeat message by the data node in the second time period, the starting time of the second time period is the starting time of the data node, and the ending time of the second time period is the last time of acquiring the heartbeat data by the proxy node.
Step S314, the agent node starts a designated analysis tool and obtains the running state data of the distributed storage equipment detected and output by the analysis tool;
step S316, the proxy node sends the acquired running state data and the device information of the distributed storage device to the management device based on the HTTP;
step S318, the management equipment determines the receiving time of the running state data and the equipment information, and generates and stores the running record of the distributed storage equipment according to the determined receiving time, the received running state data and the equipment information;
step S320, the management device determines the receiving time as the ending time point of the preset time length, determines the starting time point of the preset time length, and counts the number of target operation records of the receiving time between the determined starting time point and the ending time point in the corresponding operation records of the distributed storage device according to the received device information;
step S322, the management device determines whether the number of the target operation records is greater than a preset number, and if so, determines that the corresponding data node has a risk of loss of connection.
Step S324, the management node generates and stores early warning data of corresponding distributed storage equipment according to the target operation record;
step S326, the management node sends the early warning data to a designated administrator, and determines the application names of M target YARN applications with highest reading and writing speeds in N YARN applications included in the early warning data; and under the condition that the preset white list contains the determined application name, performing operation stop processing on the target YARN application corresponding to the contained application name.
Wherein M and N are integers and M is less than N.
The specific implementation manner of the above steps S302 to S326 can refer to the related description, and repeated details are not repeated here. In the process, whether the data nodes have the automatic identification and the automatic early warning of the loss risk or not is achieved, the phenomenon that data are lost does not exist, and human participation is not needed, so that the identification efficiency is improved, the loss risk of the data nodes is reduced, and the management efficiency and the operation stability of the distributed storage system are improved. And because a plurality of components are not required to be introduced in the process, the complexity of the technology and the operation and maintenance cost are reduced. It is to be understood that fig. 7 is intended to be illustrative only and not limiting, and that some of the operations may be performed in other ways.
Based on the same technical concept, one or more embodiments of the present application further provide a management apparatus for a distributed storage system, which is applied to a proxy node. Fig. 8 is a schematic block diagram of a management apparatus of a distributed storage system according to one or more embodiments of the present application, where as shown in fig. 8, the apparatus includes:
a first obtaining module 401, configured to obtain heartbeat data of a data node deployed in a distributed storage device and corresponding to the agent node;
a second obtaining module 402, configured to obtain operation state data of the distributed storage device if it is determined that the data node has an offline risk according to the heartbeat data;
a sending module 403, configured to send the operation state data to a management device in the distributed storage system, where the operation state data is used for performing early warning processing according to the operation state data when the management device determines that the data node has a risk of losing contact according to the operation state data.
Optionally, the apparatus further comprises: the device comprises a first determining module, a reading module and a second determining module;
the first determining module is configured to determine a first cumulative number according to the heartbeat data, where the first cumulative number is a number of times that the data node sends a heartbeat message in a first time period, a start time of the first time period is a start time of the data node, and an end time of the first time period is an acquisition time of the heartbeat data;
the reading module is configured to read a stored second accumulated time, where the second accumulated time is a time for the data node to send a heartbeat message in a second time period, a start time of the second time period is a start time of the data node, and an end time of the second time period is a time for the proxy node to obtain heartbeat data last time;
and the second determining module is used for determining whether the data node has the risk of loss of connection according to the first accumulated times and the second accumulated times.
Optionally, the second determining module is specifically configured to:
determining that the data node has the risk of loss of connection under the condition that the first accumulated times are the same as the second accumulated times;
alternatively, the first and second electrodes may be,
before determining whether the data node has the risk of losing connection according to the first accumulated number and the second accumulated number, the method further includes: sending a reference frequency acquisition request to the management equipment, and receiving the reference frequency sent by the management equipment;
determining whether the data node has the risk of losing connection according to the first accumulated times and the second accumulated times includes: calculating the difference value of the first accumulation times and the second accumulation times; and determining that the data node has the risk of loss of connection under the condition that the time difference is smaller than the reference time.
Optionally, the apparatus includes a third determining module and a fourth determining module:
the third determining module is configured to determine, according to the heartbeat data, the number of times of sending heartbeat messages of the data node in a third time period; the starting time of the third time interval is the time when the proxy node acquires heartbeat data last time, and the ending time of the third time interval is the time when the heartbeat data is acquired;
and the fourth determining module is used for determining whether the data node has the risk of loss of connection according to the sending times.
Optionally, the fourth determining module is specifically configured to:
determining that the data node has a risk of loss of connection under the condition that the sending times are zero;
alternatively, the first and second electrodes may be,
before determining whether the data node has the risk of loss of connection according to the sending times, the method further comprises: sending a reference frequency acquisition request to the management equipment, and receiving the reference frequency sent by the management equipment;
the determining whether the data node has the risk of loss of connection according to the sending times includes: and determining that the data node has the risk of loss of connection under the condition that the sending times are less than the reference times.
Further, on the basis of the same technical concept, one or more embodiments of the present application further provide another management apparatus for a distributed storage system, which is applied to a management device, corresponding to the management method for the distributed storage system described above. Fig. 9 is a schematic block diagram of another management apparatus of a distributed storage system according to one or more embodiments of the present application, where as shown in fig. 9, the apparatus includes:
a receiving module 501, configured to receive operation state data of the distributed storage device where the proxy node is located, where the operation state data is sent by the proxy node; the operating state data is sent by the agent node when the agent node determines that the data node deployed in the distributed storage equipment and corresponding to the agent node has the risk of losing connection;
a first determining module 502, configured to determine whether the data node has a risk of losing connection according to the operating state data;
and the early warning module 503 is configured to perform early warning processing according to the operation state data when the data node has a risk of losing contact.
Optionally, the receiving module 501 is specifically configured to:
receiving the running state data and the equipment information of the distributed storage equipment sent by the agent node;
correspondingly, the device further comprises a generation module;
the generating module is used for determining the receiving time of the running state data and the equipment information; generating an operation record of the distributed storage equipment according to the receiving time, the operation state data and the equipment information, and storing the operation record;
correspondingly, the first determining module 502 is specifically configured to:
and determining that the data node has the risk of loss of connection under the condition that the stored running record of the distributed storage equipment meets the preset condition.
Optionally, the first determining module 502 is further specifically configured to:
determining the receiving time as an ending time point of a preset time length, and determining a starting time point of the preset time length;
counting the number of target operation records between the starting time point and the ending time point in the stored operation records of the distributed storage equipment according to the equipment information;
and under the condition that the number of the target operation records is greater than the preset number, determining that the stored operation records of the distributed storage equipment meet preset conditions.
Optionally, the early warning module 503 is specifically configured to:
generating early warning data of the distributed storage equipment according to the target operation record;
and storing the early warning data and sending the early warning data to a designated administrator, wherein the early warning data is used for the administrator to determine a large task running in the distributed storage equipment and to perform task ending processing on the large task.
Optionally, the early warning data includes N YARN applications with the highest read-write speed running in the distributed storage device, where N is an integer; the device further comprises: a second determining module and a processing module;
the second determining module is configured to determine application names of M target YARN applications with the highest read-write speed from among the N YARN applications, where M is an integer and is smaller than N;
the processing module is configured to perform operation stop processing on the target YARN application corresponding to the included application name under the condition that a preset white list includes the application name.
Optionally, the apparatus further comprises a sending module;
the receiving module 501 is further configured to receive a reference time obtaining request sent by the proxy node before receiving the running state data of the distributed storage device where the proxy node is located, where the running state data is sent by the proxy node;
the sending module is configured to send the determined reference times to the proxy node, where the reference times are used by the proxy node to determine whether the data node has an offline risk.
It should be noted that, the embodiment of the management apparatus for a distributed storage system in the present application and the embodiment of the management method for a distributed storage system in the present application are based on the same inventive concept, and therefore, specific implementation of the embodiment may refer to implementation of the management method for a distributed storage system, and repeated details are not repeated.
Further, on the basis of the same technical concept, one or more embodiments of the present application further provide a distributed storage system corresponding to the management method of the distributed storage system described above. Fig. 10 is a schematic composition diagram of a distributed storage system according to one or more embodiments of the present application, where as shown in fig. 10, the system includes: one management device 601 and a plurality of distributed storage devices 602 (only one shown in fig. 10); a data node 6021 and a proxy node 6022 corresponding to the data node 6021 are deployed in each distributed storage device 602;
the agent node 6022 is configured to obtain heartbeat data of the data node 6021 deployed correspondingly; if the data node 6021 is determined to have the risk of loss of connection according to the heartbeat data, acquiring running state data of the distributed storage device 602; sending the operating state data to the management device 601;
the management device 601 is configured to determine whether the data node 6021 has an offline risk according to the received operating state data; and under the condition that the data node has the risk of losing the connection, early warning processing is carried out according to the running state data.
It should be noted that, the embodiment related to the distributed storage system in the present application and the embodiment related to the management method of the distributed storage system in the present application are based on the same inventive concept, and therefore, for specific implementation of the embodiment, reference may be made to implementation of the management method of the corresponding distributed storage system, and repeated parts are not described again.
Further, corresponding to the management method of the distributed storage system described above, based on the same technical concept, one or more embodiments of the present application further provide an electronic device, where the electronic device is configured to execute the management method of the distributed storage system described above, and fig. 11 is a schematic structural diagram of an electronic device provided in one or more embodiments of the present application.
As shown in fig. 11, the electronic device may have a relatively large difference due to different configurations or performances, and may include one or more processors 701 and a memory 702, where one or more stored applications or data may be stored in the memory 702. Memory 702 may be, among other things, transient storage or persistent storage. The application program stored in memory 702 may include one or more modules (not shown), each of which may include a series of computer-executable instructions in an electronic device. Still further, the processor 701 may be configured to communicate with the memory 702 to execute a series of computer-executable instructions in the memory 702 on the electronic device. The electronic device may also include one or more power supplies 703, one or more wired or wireless network interfaces 704, one or more input-output interfaces 705, one or more keyboards 706, and the like.
In one particular embodiment, an electronic device includes memory, and one or more programs, where the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the electronic device, and execution of the one or more programs by one or more processors includes computer-executable instructions for:
acquiring heartbeat data of data nodes deployed in distributed storage equipment and corresponding to the agent nodes;
if the data node is determined to have the risk of losing connection according to the heartbeat data, acquiring running state data of the distributed storage equipment;
and sending the running state data to management equipment in the distributed storage system, wherein the running state data is used for carrying out early warning processing according to the running state data when the management equipment determines that the data nodes have the risk of loss of connection according to the running state data.
In another particular embodiment, an electronic device includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the electronic device, and the one or more programs configured for execution by the one or more processors include computer-executable instructions for:
receiving operation state data of distributed storage equipment where the agent node is located, wherein the operation state data is sent by the agent node; the operating state data is sent by the agent node when the agent node determines that the data node deployed in the distributed storage equipment and corresponding to the agent node has a risk of losing connection;
determining whether the data node has a risk of loss of connection according to the running state data;
and under the condition that the data node has the risk of losing the connection, early warning processing is carried out according to the running state data.
It should be noted that, the embodiment of the electronic device in the present application and the embodiment of the management method of the distributed storage system in the present application are based on the same inventive concept, and therefore, for specific implementation of the embodiment, reference may be made to implementation of the management method of the corresponding distributed storage system, and repeated details are not described again.
Further, based on the same technical concept, one or more embodiments of the present application further provide a storage medium for storing computer-executable instructions, where in a specific embodiment, the storage medium may be a usb disk, an optical disk, a hard disk, and the like, and when the storage medium stores the computer-executable instructions, the following process can be implemented when the processor executes the computer-executable instructions:
acquiring heartbeat data of data nodes deployed in distributed storage equipment and corresponding to the agent nodes;
if the data node is determined to have the risk of losing connection according to the heartbeat data, acquiring running state data of the distributed storage equipment;
and sending the running state data to management equipment in the distributed storage system, wherein the running state data is used for carrying out early warning processing according to the running state data when the management equipment determines that the data nodes have the risk of loss of connection according to the running state data.
In another specific embodiment, the storage medium may be a usb disk, an optical disk, a hard disk, or the like, and the storage medium stores computer executable instructions that, when executed by the processor, implement the following process:
receiving operation state data of distributed storage equipment where the agent node is located, wherein the operation state data is sent by the agent node; the operating state data is sent by the agent node when the agent node determines that the data node deployed in the distributed storage equipment and corresponding to the agent node has the risk of losing connection;
determining whether the data node has a risk of loss of connection according to the running state data;
and under the condition that the data node has the risk of losing the connection, early warning processing is carried out according to the running state data.
It should be noted that the embodiment of the storage medium in the present application and the embodiment of the management method of the distributed storage system in the present application are based on the same inventive concept, and therefore, for specific implementation of the embodiment, reference may be made to implementation of the management method of the corresponding distributed storage system, and repeated details are not described again.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only an example of this document and is not intended to limit this document. Various modifications and changes may occur to those skilled in the art from this document. Any modifications, equivalents, improvements, etc. which come within the spirit and principle of the disclosure are intended to be included within the scope of the claims of this document.

Claims (16)

1. A management method of a distributed storage system is applied to a proxy node, and the method comprises the following steps:
acquiring heartbeat data of data nodes deployed in distributed storage equipment and corresponding to the agent nodes;
if the data node is determined to have the risk of losing connection according to the heartbeat data, acquiring running state data of the distributed storage equipment;
and sending the running state data to management equipment in the distributed storage system, wherein the running state data is used for carrying out early warning processing according to the running state data when the management equipment determines that the data nodes have the risk of loss of connection according to the running state data.
2. The method of claim 1, further comprising:
determining a first accumulated time according to the heartbeat data, wherein the first accumulated time is the time of sending a heartbeat message by the data node in a first time period, the starting time of the first time period is the starting time of the data node, and the ending time of the first time period is the acquiring time of the heartbeat data;
reading a second stored accumulated time, wherein the second accumulated time is the time of sending the heartbeat message by the data node in a second time period, the starting time of the second time period is the starting time of the data node, and the ending time of the second time period is the last time of acquiring the heartbeat data by the proxy node;
and determining whether the data node has the risk of losing the link according to the first accumulated times and the second accumulated times.
3. The method according to claim 2, wherein said determining whether the data node is at risk of losing contact according to the first cumulative number and the second cumulative number comprises: determining that the data node has the risk of loss of connection under the condition that the first accumulated times are the same as the second accumulated times;
alternatively, the first and second electrodes may be,
before determining whether the data node has the risk of losing connection according to the first accumulated number and the second accumulated number, the method further includes: sending a reference frequency acquisition request to the management equipment, and receiving the reference frequency sent by the management equipment;
determining whether the data node has the risk of losing connection according to the first accumulated times and the second accumulated times includes: calculating the difference value of the first accumulation times and the second accumulation times; and determining that the data node has the risk of loss of connection under the condition that the time difference is smaller than the reference time.
4. The method of claim 1, further comprising:
determining the sending times of the heartbeat messages of the data nodes in a third time period according to the heartbeat data; the starting time of the third time interval is the time when the proxy node acquires heartbeat data last time, and the ending time of the third time interval is the time when the heartbeat data is acquired;
and determining whether the data node has the risk of loss of connection according to the sending times.
5. The method of claim 4, wherein the determining whether the data node is at risk of loss of connection according to the number of transmissions comprises: determining that the data node has a risk of loss of connection under the condition that the sending times are zero;
alternatively, the first and second electrodes may be,
before determining whether the data node has the risk of loss of connection according to the sending times, the method further comprises: sending a reference frequency acquisition request to the management equipment, and receiving the reference frequency sent by the management equipment;
the determining whether the data node has the risk of loss of connection according to the sending times includes: and determining that the data node has the risk of loss of connection under the condition that the sending times are less than the reference times.
6. A management method of a distributed storage system is applied to a management device, and the method comprises the following steps:
receiving the running state data of the distributed storage equipment where the proxy node is located, which is sent by the proxy node; the operating state data is sent by the agent node when the agent node determines that the data node deployed in the distributed storage equipment and corresponding to the agent node has the risk of losing connection;
determining whether the data node has a risk of loss of connection according to the running state data;
and under the condition that the data node has the risk of losing the connection, early warning processing is carried out according to the running state data.
7. The method according to claim 6, wherein the receiving the operation status data of the distributed storage device where the agent node is located sent by the agent node comprises: receiving the running state data and the equipment information of the distributed storage equipment sent by the agent node;
before determining whether the data node has a risk of loss of connection according to the operating state data, the method further includes: determining the receiving time of the running state data and the equipment information; generating an operation record of the distributed storage equipment according to the receiving time, the operation state data and the equipment information, and storing the operation record;
the determining whether the data node has the risk of loss of connection according to the operation state data includes: and determining that the data node has the risk of loss of connection under the condition that the stored running record of the distributed storage equipment meets the preset condition.
8. The method of claim 7, further comprising:
determining the receiving time as an ending time point of a preset time length, and determining a starting time point of the preset time length;
counting the number of target operation records between the starting time point and the ending time point in the stored operation records of the distributed storage equipment according to the equipment information;
and under the condition that the number of the target operation records is greater than the preset number, determining that the stored operation records of the distributed storage equipment meet preset conditions.
9. The method of claim 8, wherein performing early warning processing based on the operational status data comprises:
generating early warning data of the distributed storage equipment according to the target operation record;
and storing the early warning data and sending the early warning data to a designated administrator, wherein the early warning data is used for the administrator to determine a large task running in the distributed storage equipment according to the early warning data and to perform task ending processing on the large task.
10. The method of claim 9 wherein the early warning data comprises the N YARN applications running in the distributed storage device with the highest read-write speed, N being an integer; the method further comprises the following steps:
determining application names of M target YARN applications with highest read-write speed in the N YARN applications, wherein M is an integer and is smaller than N;
and under the condition that the preset white list contains the application name, stopping the running of the target YARN application corresponding to the contained application name.
11. The method according to claim 6, wherein before receiving the operation status data of the distributed storage device where the agent node is located, the method further comprises:
receiving a reference frequency acquisition request sent by the agent node;
and sending the determined reference times to the agent node, wherein the reference times are used for the agent node to determine whether the data node has the risk of loss of connection.
12. A management apparatus of a distributed storage system, applied to a proxy node, the apparatus comprising:
the first acquisition module is used for acquiring heartbeat data of data nodes which are deployed in the distributed storage equipment and correspond to the agent nodes;
the second acquisition module is used for acquiring the running state data of the distributed storage equipment if the data node is determined to have the risk of losing connection according to the heartbeat data;
and the sending module is used for sending the running state data to the management equipment in the distributed storage system, and the running state data is used for carrying out early warning processing according to the running state data when the management equipment determines that the data node has the risk of losing connection according to the running state data.
13. A management apparatus of a distributed storage system, applied to a management device, the apparatus comprising:
the receiving module is used for receiving the running state data of the distributed storage equipment where the proxy node is located, and the running state data is sent by the proxy node; the operating state data is sent by the agent node when the agent node determines that the data node deployed in the distributed storage equipment and corresponding to the agent node has the risk of losing connection;
the first determining module is used for determining whether the data node has the risk of loss of connection according to the running state data;
and the early warning module is used for carrying out early warning processing according to the operation state data under the condition that the data node has the risk of losing the link.
14. A distributed storage system, comprising: a management device and a plurality of distributed storage devices; data nodes and agent nodes corresponding to the data nodes are deployed in each distributed storage device;
the agent node is used for acquiring heartbeat data of the correspondingly deployed data nodes; if the data node is determined to have the risk of losing connection according to the heartbeat data, acquiring running state data of the distributed storage equipment; sending the running state data to the management equipment;
the management equipment is used for determining whether the data node has the risk of loss of connection according to the received running state data; and under the condition that the data node has the risk of losing the connection, early warning processing is carried out according to the running state data.
15. An electronic device, comprising:
a processor; and the number of the first and second groups,
a memory arranged to store computer executable instructions configured for execution by the processor, the executable instructions comprising instructions for performing the steps in the method of any one of claims 1-5, or the executable instructions comprising instructions for performing the steps in the method of any one of claims 6-11.
16. A storage medium for storing computer-executable instructions for causing a computer to perform the method of any one of claims 1 to 5 or for causing a computer to perform the method of any one of claims 6 to 11.
CN202210158459.9A 2022-02-21 2022-02-21 Distributed storage system and management method, device and equipment thereof Pending CN114546979A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210158459.9A CN114546979A (en) 2022-02-21 2022-02-21 Distributed storage system and management method, device and equipment thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210158459.9A CN114546979A (en) 2022-02-21 2022-02-21 Distributed storage system and management method, device and equipment thereof

Publications (1)

Publication Number Publication Date
CN114546979A true CN114546979A (en) 2022-05-27

Family

ID=81676643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210158459.9A Pending CN114546979A (en) 2022-02-21 2022-02-21 Distributed storage system and management method, device and equipment thereof

Country Status (1)

Country Link
CN (1) CN114546979A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103701661A (en) * 2013-12-23 2014-04-02 浪潮(北京)电子信息产业有限公司 Method and system for realizing node monitoring
US20170063911A1 (en) * 2015-08-31 2017-03-02 Splunk Inc. Lateral Movement Detection for Network Security Analysis
CN109660380A (en) * 2018-09-28 2019-04-19 深圳壹账通智能科技有限公司 Monitoring method, platform, system and the readable storage medium storing program for executing of operation condition of server

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103701661A (en) * 2013-12-23 2014-04-02 浪潮(北京)电子信息产业有限公司 Method and system for realizing node monitoring
US20170063911A1 (en) * 2015-08-31 2017-03-02 Splunk Inc. Lateral Movement Detection for Network Security Analysis
CN109660380A (en) * 2018-09-28 2019-04-19 深圳壹账通智能科技有限公司 Monitoring method, platform, system and the readable storage medium storing program for executing of operation condition of server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵志勇: "移动Hadoop集群监控系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 9, 15 September 2015 (2015-09-15), pages 138 - 473 *
软件求生: "分布式:分布式系统设计策略", 《HTTPS://BLOG.CSDN.NET/EN_JOKER/ARTICLE/DETAILS/102922420》, 5 November 2019 (2019-11-05), pages 1 - 3 *

Similar Documents

Publication Publication Date Title
CN110413346B (en) Parameter updating method and device
CN102868736B (en) A kind of cloud computing Monitoring framework design basis ground motion method and cloud computing treatment facility
US8930521B2 (en) Method, apparatus, and computer program product for enabling monitoring of a resource
CN111966289B (en) Partition optimization method and system based on Kafka cluster
CN108427619B (en) Log management method and device, computing equipment and storage medium
CN112130996A (en) Data monitoring control system, method and device, electronic equipment and storage medium
CN102385536B (en) Method and system for realization of parallel computing
CN110677475A (en) Micro-service processing method, device, equipment and storage medium
US20100332582A1 (en) Method and System for Service Contract Discovery
CN113986534A (en) Task scheduling method and device, computer equipment and computer readable storage medium
CN109586970B (en) Resource allocation method, device and system
US10348814B1 (en) Efficient storage reclamation for system components managing storage
CN116204239A (en) Service processing method, device and computer readable storage medium
CN112698929A (en) Information acquisition method and device
CN106550002B (en) paas cloud hosting system and method
CN111104212A (en) Scheduling task execution method and device, electronic equipment and storage medium
CN114546979A (en) Distributed storage system and management method, device and equipment thereof
CN114610446B (en) Method, device and system for automatically injecting probe
Wei et al. An agent-based services framework with adaptive monitoring in cloud environments
CN113055493B (en) Data packet processing method, device, system, scheduling device and storage medium
CN111026598A (en) Data acquisition method and device
CN111831503A (en) Monitoring method based on monitoring agent and monitoring agent device
CN105760215A (en) Map-reduce model based job running method for distributed file system
CN112764837B (en) Data reporting method, device, storage medium and terminal
CN113360558A (en) Data processing method, data processing device, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination