CN104158843A - Storage unit invalidation detecting method and device for distributed file storage system - Google Patents

Storage unit invalidation detecting method and device for distributed file storage system Download PDF

Info

Publication number
CN104158843A
CN104158843A CN201410333913.5A CN201410333913A CN104158843A CN 104158843 A CN104158843 A CN 104158843A CN 201410333913 A CN201410333913 A CN 201410333913A CN 104158843 A CN104158843 A CN 104158843A
Authority
CN
China
Prior art keywords
node
distributed file
storage system
memory cell
file storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410333913.5A
Other languages
Chinese (zh)
Other versions
CN104158843B (en
Inventor
李璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Toyou Feiji Electronics Co., Ltd.
Original Assignee
SHENZHEN ZHONGBO KECHUANG INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN ZHONGBO KECHUANG INFORMATION TECHNOLOGY Co Ltd filed Critical SHENZHEN ZHONGBO KECHUANG INFORMATION TECHNOLOGY Co Ltd
Priority to CN201410333913.5A priority Critical patent/CN104158843B/en
Publication of CN104158843A publication Critical patent/CN104158843A/en
Application granted granted Critical
Publication of CN104158843B publication Critical patent/CN104158843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a storage unit invalidation detecting method for a distributed file storage system. The method comprises the following steps: acquiring operation marks of storage units of nodes in sequence; when the acquiring of the operation mark of one storage unit is in fail, recording the storage unit as an invalid storage unit, and continuing acquiring the operation marks of other storage units of the node wherein the operation mark is located, or acquiring the operation marks of the storage units of other nodes in sequence. The invention further discloses a storage unit invalidation detecting device for the distributed file storage system. According to the storage unit invalidation detecting method and device for the distributed file storage system, the operation marks of the storage units of the nodes are acquired in sequence to determine the invalid storage units and to record, so that the invalid storage units in the nodes of the distributed file storage system can be effectively detected, a user can maintain the invalid storage units in time, and the reliability of the distributed file storage system is ensured.

Description

Storage-unit-failure detection method and the device of distributed file storage system
Technical field
The present invention relates to distributed file system failure detection field, relate in particular to storage-unit-failure detection method and the device of distributed file storage system.
Background technology
In recent years, network distribution type storage has become the new trend of Development of storage technology.Distributed file system is to build the requisite part of large-scale distributed storage system.Because data are to be distributed in the memory cell of different memory nodes, when even certain several storage-unit-failure is unavailable, because these data still exist in some memory cell of other nodes, so access node is normal visit data still, this just provides the high reliability of data.Although data have back-up storage in other memory cell, when lost efficacy the continuous cumulative rises of memory cell time, may cause the loss of data, and then cause data normally not access, distributed file storage system lost efficacy unavailable.
Therefore, need the scheme that a kind of detection of stored element failure is provided badly, to find in time the failed storage unit in distributed file storage system, thereby be convenient to carry out the timely replacing of memory cell, guarantee the high reliability of distributed file storage system.
Summary of the invention
Main purpose of the present invention is to solve the technical problem that distributed file storage system can not detect failed storage unit.
For achieving the above object, the storage-unit-failure detection method of a kind of distributed file storage system provided by the invention, the storage-unit-failure detection method of described distributed file storage system comprises the following steps:
Obtain successively the operation sign of the memory cell of each node;
When the operation sign of memory cell is obtained unsuccessfully, recording this memory cell is failed storage unit, and continues to obtain the operation sign of other memory cell of this memory cell place node, or, obtain successively the operation sign of the memory cell of other node.
Preferably, when the described sign of the operation in memory cell is obtained unsuccessfully, recording this memory cell is that failed storage unit comprises:
When the operation sign of memory cell is obtained unsuccessfully, restart this memory cell;
In default very first time interval, if memory cell is restarted unsuccessfully, recording this memory cell is failed storage unit.
Preferably, when the described sign of the operation in memory cell is obtained unsuccessfully, recording this memory cell is failed storage unit, and the operation sign of other memory cell in this node is obtained in continuation successively, or, after obtaining successively the step of operation sign of memory cell of other node, the storage-unit-failure detection method of described distributed file storage system also comprises:
Determine the quantity of failed storage unit described in distributed file storage system;
When the quantity of failed storage unit described in distributed file storage system is greater than first threshold, determine that described distributed file storage system lost efficacy.
Preferably, before the step that described operation of obtaining successively the memory cell of each node identifies, the storage-unit-failure detection method of described distributed file storage system also comprises:
Control between the node in distributed file storage system and mutually send and detect packet;
Successively using arbitrary node in distributed file storage system as Section Point, other node as first node to determine the validity of Section Point;
Within default very first time interval, determine the quantity of the first node that does not receive response data packet, the detection packet that the described Section Point of described feedback data packet sends based on described first node feeds back;
When the quantity that does not receive the first node of response data packet is greater than the second default threshold values, recording described Section Point is failure node, and by described Section Point shielding.
Preferably, when the described quantity not receiving the first node of response data packet is greater than the second default threshold values, recording described Section Point is failure node, and by after the step of described Section Point shielding, the storage-unit-failure detection method of described distributed file storage system also comprises:
Determine the quantity of failure node described in distributed file storage system;
When the quantity of failure node described in distributed file storage system is less than the 3rd default threshold value, determine that described distributed file storage system is effective.
In addition, for achieving the above object, the present invention also provides a kind of storage-unit-failure checkout gear of distributed file storage system, and the storage-unit-failure checkout gear of described distributed file storage system comprises:
Acquisition module, for obtaining successively the operation sign of the memory cell of each node, and while obtaining unsuccessfully for the operation sign in memory cell, continue to obtain the operation sign of other memory cell of this memory cell place node, or the operation of obtaining successively the memory cell of other node identifies;
Logging modle, while obtaining unsuccessfully for the operation sign in memory cell, recording this memory cell is failed storage unit.
Preferably, described logging modle comprises:
Restart unit, while obtaining unsuccessfully for the operation sign in memory cell, restart this memory cell;
Record cell, in the very first time interval default, if memory cell is restarted unsuccessfully, recording this memory cell is failed storage unit.
Preferably, the storage-unit-failure checkout gear of described distributed file storage system also comprises:
The first determination module, for obtaining the quantity of failed storage unit described in distributed file storage system;
The second determination module, while being greater than first threshold for the quantity in failed storage unit described in distributed file storage system, determines that described distributed file storage system lost efficacy.
Preferably, the storage-unit-failure checkout gear of described distributed file storage system also comprises:
Control module, mutually sends and detects packet for controlling between the node of distributed file storage system;
Node availability detection module, for successively using the arbitrary node of distributed file storage system as Section Point, other node as first node to determine the validity of Section Point;
The 3rd determination module, within the very first time interval default, determines the quantity of the first node that does not receive response data packet, and the detection packet that the described Section Point of described feedback data packet sends based on described first node feeds back;
Shroud module, while being greater than the second default threshold values for the quantity not receiving the first node of response data packet, recording described Section Point is failure node, and by described Section Point shielding.
Preferably, the storage-unit-failure checkout gear of described distributed file storage system also comprises:
The 4th determination module, for determining the quantity of failure node described in distributed file storage system;
The 5th determination module, while being less than the 3rd default threshold value for the quantity at failure node described in distributed file storage system, determines that described distributed file storage system is effective.
Storage-unit-failure detection method and the device of distributed file storage system of the present invention, by obtaining successively the operation of memory cell in each node, identify, to determine failed storage unit record, can effectively detect the failed storage unit in the node of distributed file storage system, for user, in time the memory cell losing efficacy is safeguarded, guaranteed the reliability of distributed file storage system.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of storage-unit-failure detection method first embodiment of distributed file storage system of the present invention;
Fig. 2 is the schematic flow sheet of storage-unit-failure detection method second embodiment of distributed file storage system of the present invention;
Fig. 3 is the schematic flow sheet of storage-unit-failure detection method the 3rd embodiment of distributed file storage system of the present invention;
Fig. 4 is the schematic flow sheet of storage-unit-failure detection method the 4th embodiment of distributed file storage system of the present invention;
Fig. 5 is the schematic flow sheet of storage-unit-failure detection method the 5th embodiment of distributed file storage system of the present invention;
Fig. 6 is the high-level schematic functional block diagram of storage-unit-failure checkout gear first embodiment of distributed file storage system of the present invention;
Fig. 7 is the high-level schematic functional block diagram of storage-unit-failure checkout gear second embodiment of distributed file storage system of the present invention;
Fig. 8 is the high-level schematic functional block diagram of storage-unit-failure checkout gear the 3rd embodiment of distributed file storage system of the present invention;
Fig. 9 is the high-level schematic functional block diagram of storage-unit-failure checkout gear the 4th embodiment of distributed file storage system of the present invention;
Figure 10 is the high-level schematic functional block diagram of storage-unit-failure checkout gear the 5th embodiment of distributed file storage system of the present invention.
The realization of the object of the invention, functional characteristics and advantage, in connection with embodiment, are described further with reference to accompanying drawing.
Embodiment
Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
The invention provides a kind of storage-unit-failure detection method (following describe in referred to as storage-unit-failure detection method) of distributed file storage system.
With reference to Fig. 1, the schematic flow sheet of storage-unit-failure detection method the first embodiment that Fig. 1 is distributed file storage system of the present invention.
In the first embodiment, this storage-unit-failure detection method comprises:
Step S10, the operation of obtaining successively the memory cell of each node identifies;
In distributed file storage system, each node comprises a plurality of memory cell, and each memory cell, when operation, has the operation sign (for example moving the process number of process) of unique correspondence.During distributed file storage system operation, each node work, the memory cell operation work in node can get operation sign corresponding to this memory cell in node; If certain memory cell is not moved work in node, in node, obtain less than operation sign corresponding to this memory cell.By obtaining successively the operation of the memory cell of each node, identify, determine the memory cell of moving work in node, and the memory cell of not moving work, have the memory cell of fault.
Step S20, when the operation sign of memory cell is obtained unsuccessfully, recording this memory cell is failed storage unit, and continues to obtain the operation sign of other memory cell of this memory cell place node, or the operation of obtaining successively the memory cell of other node identifies.
When the operation sign that has memory cell is obtained unsuccessfully, while obtaining the operation sign less than this memory cell, the not operation of this memory cell is described, there is fault and can not use, by this unit records, be failed storage unit; And continue to obtain the operation sign of other memory cell of this memory cell place node or the operation sign of the memory cell of other node.After recording failed storage unit, can send maintenance request to maintenance terminal (terminal of carrying as server and maintenance personal etc.), remind in time failed storage unit to be repaired or replaced, to guarantee the reliability of distributed file storage system.
The storage-unit-failure detection method that the present embodiment proposes, by obtaining successively the operation of memory cell in each node, identify, to determine failed storage unit record, can effectively detect the failed storage unit in the node of distributed file storage system, for user, in time the memory cell losing efficacy is safeguarded, guaranteed the reliability of distributed file storage system.
With reference to Fig. 2, the schematic flow sheet of storage-unit-failure detection method the second embodiment that Fig. 2 is distributed file storage system of the present invention.
The scheme of the scheme of the second embodiment based on the first embodiment, in a second embodiment, in the step S20 of this storage-unit-failure detection method, when the operation sign of memory cell is obtained unsuccessfully, recording this memory cell is that failed storage unit comprises:
Step S21, when the operation sign of memory cell is obtained unsuccessfully, restarts this memory cell;
Because some memory cell can not be moved the failure problems of work, can be by restarting solution, make it recover normal operation work, therefore when the operation sign that has memory cell is obtained unsuccessfully, first restart this memory cell, so that partial memory cell can normally move by immediate recovery, make distributed file storage system keep as far as possible many memory cell operation work, guarantee the maximum reliability of distributed file storage system operation, and reduced attendant's maintenance workload.
Step S22, in default very first time interval, if memory cell is restarted unsuccessfully, recording this memory cell is failed storage unit.
The memory cell that conventionally can solve by restarting fault can be restarted successfully in (very first time) at the appointed time, and the memory cell that can not solve by restarting fault can not be restarted successfully in regulation.To returning to the result of reboot operation behind the very first time interval of the memory cell execution reboot operation of fault, if that returns restarts result for to restart unsuccessfully (this memory cell is restarted unsuccessfully within default very first time interval), recording this memory cell is failed storage unit; If that returns restarts result for restarting successfully (this memory cell is restarted successfully within default very first time interval), this memory cell is recovered normal operation, now can get the operation sign of this memory cell, and then judge that this memory cell is effective, the operation sign of other memory cell of this memory cell place node is obtained in continuation successively, or the operation of obtaining successively the memory cell of other node identifies.
The storage-unit-failure detection method of the present embodiment, when the operation sign of memory cell is obtained unsuccessfully, memory cell is restarted, so that can solve by restarting the memory cell of failure problems, by restarting, recover normal operation immediately, the storage degree unit record that can not restart solution failure problems is failed storage unit, make distributed file storage system keep as far as possible many memory cell operation work, guarantee the reliability of distributed file storage system, and reduce attendant's maintenance workload.
With reference to Fig. 3, the schematic flow sheet of storage-unit-failure detection method the 3rd embodiment that Fig. 3 is distributed file storage system of the present invention.
The scheme of the scheme of the 3rd embodiment based on the first embodiment or the second embodiment, in the 3rd embodiment, after step S20, storage-unit-failure detection method also comprises:
Step S30, determines the quantity of failed storage unit described in distributed file storage system;
In distributed file storage system, the operation of the memory cell of each node sign has been obtained successively and has been recorded out after all failed storage unit, determines the total quantity of the failed storage unit of record.
Step S40, when the quantity of failed storage unit described in distributed file storage system is greater than first threshold, determines that described distributed file storage system lost efficacy.
Default first threshold is preferably total half of the memory cell of all available nodes in distributed file storage system (do not lose efficacy node).When the quantity of failed storage unit described in distributed file storage system surpasses first threshold, think that abnormal (can not visit data or the data of access incorrect etc.) easily appears in the transfer of data of distributed file storage system and access, the reliability of the data of distributed file storage system is low, now determine that distributed file storage system lost efficacy, out of service.
The storage-unit-failure detection method of the present embodiment, when the quantity of the failed storage unit in distributed file storage system surpasses first threshold, distributed file storage system is defined as losing efficacy, stop distributed file storage system operation, avoid distributed file storage system continue operation cause loss of data and visit data abnormal.
With reference to Fig. 4, the schematic flow sheet of storage-unit-failure detection method the 4th embodiment that Fig. 4 is distributed file storage system of the present invention.
The scheme of the scheme of the 4th embodiment based on arbitrary embodiment in the first to the 3rd embodiment, in the 4th embodiment, before step S10, storage-unit-failure detection method also comprises:
Step S50, controls between the node in distributed file storage system and mutually sends and detect packet;
In the present embodiment, can control between each node and send and detect packet mutually, to guarantee the mutual detection of running status between each node in distributed file storage system.
Step S60, successively using arbitrary node in distributed file storage system as Section Point, other node as first node to determine the validity of Section Point;
For example, in distributed file storage system, there are A, B, C, tetra-nodes of D, using B node as Section Point, A, C, tri-nodes of D are first node, judge that whether B node is effective, after judging that B node is whether effectively, can continue judge that whether C node effective according to default order, the like until detected all nodes.
Step S70, determines within default very first time interval, does not receive the quantity of the first node of response data packet, and the detection packet that the described Section Point of described feedback data packet sends based on first node feeds back;
In the present embodiment, Section Point, when receiving packet, is resolved to determine the type of the packet receiving to the packet receiving, and at the packet receiving, is while detecting packet, to described first node feedback response packet.Owing to there is the situation of communication link fails, first node does not receive the feedback data that Section Point sends and comprises that multiple situation: a, communication link break down; B, first node break down not send and detect packet; C, Section Point break down and do not send feedback data packet.
In the present embodiment, the step of quantity of determining the first node of the response data packet do not receive Section Point feedback can realize by following scheme: when a, first node do not receive response data packet in default very first time interval, recording Section Point is insincere node with respect to first node, and record the sign (as title and code etc.) of first node, the quantity of the sign of the first node of this record is the quantity of the first node of the response data packet that does not receive Section Point feedback; When b, first node do not receive response data packet in default very first time interval, recording described Section Point is insincere node.This step that records insincere node can be accomplished in several ways, for example, set up trusted node database and insincere node database, when Section Point is recorded as to insincere node, identified (as title and code etc.) and be added in insincere node database; Or, when Section Point is recorded as to insincere node, add insincere sign to described Section Point, and obtaining that to record Section Point be insincere degree of node, this records Section Point is the quantity that insincere degree of node is the first node of the response data packet that does not receive Section Point feedback.
Step S80, when the quantity that does not receive the first node of response data packet is greater than the second default threshold values, recording described Section Point is failure node, and by described Section Point shielding.
Failure node can not be used, to the storage-unit-failure in failure node, detect nonsensical, and in node, the quantity of memory cell is more, in order to have improved the efficiency of storage-unit-failure detection method, therefore the present embodiment shields the failure node detecting, make not obtain the operation sign of the memory cell of failure node, avoid insignificant detection.The second threshold values can be set by user, and half of the quantity that preferred version is first node, to guarantee that recording Section Point is failure node, and failure node is shielded when most of first node does not receive the response data packet of Section Point feedback.
The storage element abatement detecting method that the present embodiment proposes, before the operation sign of memory cell of obtaining successively each node, first detect the failure node in distributed file storage system and failure node is masked, do not obtain the operation sign of the memory cell of failure node, failure node is not carried out to storage-unit-failure detection, significantly improved the efficiency that storage-unit-failure detects.
With reference to Fig. 5, the schematic flow sheet of storage-unit-failure detection method the 5th embodiment that Fig. 5 is distributed file storage system of the present invention.
The scheme of the 5th embodiment based on the 4th embodiment, in the 5th embodiment, after step S80 and before step S10, storage-unit-failure detection method also comprises:
Step S90, determines the quantity of failure node described in distributed file storage system;
Step S100, when the quantity of failure node described in distributed file storage system is less than the 3rd default threshold value, determines that described distributed file storage system is effective.
In the present embodiment, the 3rd default threshold values is preferably half of number of nodes in distributed file storage system, in distributed file storage system during node major part unavailable (having most of failure node), think that this distributed file storage system can not carry out transfer of data, determine that this distributed file storage system lost efficacy, now distributed file storage system is unavailable, then the failure detection of the memory cell of distributed file storage system has not been had to meaning.When in distributed file storage system, the quantity of failure node is less than the 3rd threshold value, distributed file storage system is just defined as effectively, now just meaningful to the failure detection of the memory cell of distributed file storage system.After recording failure node and determining that this distributed file storage system lost efficacy, can send maintenance request to maintenance terminal (terminal of carrying as server and maintenance personal etc.), guarantee that failure node and distributed file storage system recover normal in time.
The storage-unit-failure detection method of the present embodiment, before the operation sign of memory cell of obtaining successively each node, first determine that whether distributed file storage system is available, when distributed file storage system is available, just the memory cell of the node of distributed file storage system is carried out to failure detection, avoided when distributed file storage system lost efficacy, distributed file storage system has been done to insignificant storage-unit-failure and detect.
The present invention also provides a kind of storage-unit-failure checkout gear (following describe in referred to as storage-unit-failure checkout gear) of distributed file storage system.
With reference to Fig. 6, the high-level schematic functional block diagram of storage-unit-failure checkout gear the first embodiment that Fig. 6 is distributed file storage system of the present invention.
In the first embodiment, described storage-unit-failure checkout gear comprises:
Acquisition module 10, for obtaining successively the operation sign of the memory cell of each node, and while obtaining unsuccessfully for the operation sign in memory cell, continue to obtain the operation sign of other memory cell of this memory cell place node, or the operation of obtaining successively the memory cell of other node identifies;
In distributed file storage system, each node comprises a plurality of memory cell, and each memory cell, when operation, has the operation sign (for example moving the process number of process) of unique correspondence.During distributed file storage system operation, each node work, the memory cell operation work in node can get operation sign corresponding to this memory cell in node; If certain memory cell is not moved work in node, in node, obtain less than operation sign corresponding to this memory cell.The operation of obtaining successively the memory cell of each node by acquisition module 10 identifies, and determines the memory cell of moving work in node, and the memory cell of not moving work, has the memory cell of fault.
Logging modle 20, while obtaining unsuccessfully for the operation sign in memory cell, recording this memory cell is failed storage unit.
When the operation of obtaining memory cell at acquisition module 10 identifies unsuccessfully, be that acquisition module 10 obtains operation when sign less than this memory cell, the not operation of this memory cell is described, has fault and can not use, logging modle 20 is failed storage unit by this unit records; And acquisition module 10 continues to obtain the operation sign of other memory cell of this memory cell place node or the operation sign of the memory cell of other node.After logging modle 20 records failed storage unit, can send maintenance request to maintenance terminal (terminal of carrying as server and maintenance personal etc.), remind and in time failed storage unit is repaired or replaced, to guarantee the reliability of distributed file storage system.
The storage-unit-failure checkout gear that the present embodiment proposes, the operation of obtaining successively memory cell in each node by acquisition module 10 identifies, to determine failed storage unit and to carry out record by logging modle 20, can effectively detect the failed storage unit in the node of distributed file storage system, for user, in time the memory cell losing efficacy is safeguarded, guaranteed the reliability of distributed file storage system.
With reference to Fig. 7, the high-level schematic functional block diagram of storage-unit-failure checkout gear the second embodiment that Fig. 7 is distributed file storage system of the present invention.
The scheme of the scheme of the second embodiment based on the first embodiment, in a second embodiment, the logging modle 20 of described storage-unit-failure checkout gear comprises:
Restart unit 21, while obtaining unsuccessfully for the operation sign in memory cell, restart this memory cell;
Because some memory cell can not be moved the failure problems of work, can be by restarting solution, make it recover normal operation work, when the operation of therefore obtaining memory cell at acquisition module 10 identifies unsuccessfully, restart unit 21 and restart this memory cell, so that some memory cell can normally be moved by immediate recovery, make distributed file storage system keep as far as possible many memory cell operation work, guarantee the maximum reliability of distributed file storage system operation, and reduced attendant's maintenance workload.
Record cell 22, in the very first time interval default, if memory cell is restarted unsuccessfully, recording this memory cell is failed storage unit.
Conventionally the memory cell that can restart by restarting unit 21 solution fault can be restarted successfully in (very first time) at the appointed time, and the memory cell that can not restart by restarting unit 21 solution fault can not be restarted successfully in regulation.To returning to the result of reboot operation behind the very first time interval of the memory cell execution reboot operation of fault, if that returns restarts result for restarting unsuccessfully (this memory cell is restarted unsuccessfully within default very first time interval), to record this memory cell be failed storage unit to record cell 22; If that returns restarts result for restarting successfully (this memory cell is restarted successfully within default very first time interval), this memory cell is recovered normal operation, now acquisition module 10 can get the operation sign of this memory cell, and then judge that this memory cell is effective, acquisition module 10 continues to obtain the operation sign of other memory cell of this memory cell place node, or the operation of obtaining successively the memory cell of other node identifies.
The storage-unit-failure checkout gear of the present embodiment, when the operation sign of memory cell is obtained unsuccessfully, restarting 21 pairs of unit memory cell restarts, so that can solve by restarting the memory cell of failure problems, by restarting, recover normal operation immediately, the storage degree unit record that record cell 22 can not be restarted solution failure problems is failed storage unit, make distributed file storage system keep as far as possible many memory cell operation work, guarantee the reliability of distributed file storage system, and reduce attendant's maintenance workload.
With reference to Fig. 8, the high-level schematic functional block diagram of storage-unit-failure checkout gear the 3rd embodiment that Fig. 8 is distributed file storage system of the present invention.
The scheme of the scheme of the 3rd embodiment based on the first or second embodiment, in the 3rd embodiment, described storage-unit-failure checkout gear also comprises:
The first determination module 30, for obtaining the quantity of failed storage unit described in distributed file storage system;
In distributed file storage system, acquisition module 10 has obtained the operation sign of the memory cell of each node successively, and logging modle 20 records out after all inefficacy storage units, and the first determination module 30 is determined the total quantity of the failed storage unit of record.
The second determination module 40, while being greater than first threshold for the quantity in failed storage unit described in distributed file storage system, determines that described distributed file storage system lost efficacy.
Default first threshold is preferably total half of the memory cell of all available nodes in distributed file storage system (do not lose efficacy node).The second determination module 40 is when the quantity of failed storage unit described in distributed file storage system surpasses first threshold, think that abnormal (can not visit data or the data of access incorrect etc.) easily appears in the transfer of data of distributed file storage system and access, the reliability of the data of distributed file storage system is low, now determine that distributed file storage system lost efficacy, out of service.
With reference to Fig. 9, the high-level schematic functional block diagram of storage-unit-failure checkout gear the 4th embodiment that Fig. 9 is distributed file storage system of the present invention.
The scheme of the scheme of the 4th embodiment based on arbitrary embodiment in the first to the 3rd embodiment, in the 4th embodiment, described storage-unit-failure checkout gear also comprises:
Control module 50, mutually sends and detects packet for controlling between the node of distributed file storage system;
In the present embodiment, control module 50 can be controlled between each node and send and detect packet mutually, to guarantee the mutual detection of running status between each node in distributed file storage system.
Node availability detection module 60, for successively using the arbitrary node of distributed file storage system as Section Point, other node as first node to determine the validity of Section Point;
For example, in distributed file storage system, there are A, B, C, tetra-nodes of D, node availability detection module 60 is using B node as Section Point, A, C, tri-nodes of D are first node, judge that whether B node is effective, after judging that B node is whether effectively, node availability detection module 60 can continue judge that whether C node effective according to default order, the like until detected all nodes.
The 3rd determination module 70, within the very first time interval default, determines the quantity of the first node that does not receive response data packet, and the detection packet that the described Section Point of described feedback data packet sends based on described first node feeds back;
In the present embodiment, Section Point, when receiving packet, is resolved to determine the type of the packet receiving to the packet receiving, and at the packet receiving, is while detecting packet, to described first node feedback response packet.Owing to there is the situation of communication link fails, first node does not receive the feedback data that Section Point sends and comprises that multiple situation: a, communication link break down; B, first node break down not send and detect packet; C, Section Point break down and do not send feedback data packet.
In the present embodiment, the 3rd determination module 70 determines that the step of quantity of the first node of the response data packet that does not receive Section Point feedback can realize by following scheme: when a, first node do not receive response data packet in default very first time interval, recording Section Point is insincere node with respect to first node, and record the sign (as title and code etc.) of first node, the quantity of the sign of the first node of this record is the quantity of the first node of the response data packet that does not receive Section Point feedback; When b, first node do not receive response data packet in default very first time interval, recording described Section Point is insincere node.This step that records insincere node can be accomplished in several ways, for example, set up trusted node database and insincere node database, when Section Point is recorded as to insincere node, identified (as title and code etc.) and be added in insincere node database; Or, when Section Point is recorded as to insincere node, add insincere sign to described Section Point, and obtaining that to record Section Point be insincere degree of node, this records Section Point is the quantity that insincere degree of node is the first node of the response data packet that does not receive Section Point feedback.
Shroud module 80, while being greater than the second default threshold values for the quantity not receiving the first node of response data packet, recording described Section Point is failure node, and by described Section Point shielding.
Failure node can not be used, to the storage-unit-failure in failure node, detect nonsensical, and in node, the quantity of memory cell is more, in order to have improved the efficiency of storage-unit-failure detection method, therefore shroud module 80 shields the failure node detecting, make not obtain the operation sign of the memory cell of failure node, avoid insignificant detection.The second threshold values can be set by user, preferred version is half of quantity of first node, to guarantee that it is failure node that shroud module 80 records Section Point, and failure node is shielded when most of first node does not receive the response data packet of Section Point feedback.
The storage element failure detection device that the present embodiment proposes, obtain successively the operation sign of memory cell of each node at acquisition module 10 before, first detect the failure node in distributed file storage system and by shroud module 80, failure node masked, do not obtain the operation sign of the memory cell of failure node, failure node is not carried out to storage-unit-failure detection, significantly improved the efficiency that storage-unit-failure detects.
With reference to Figure 10, the high-level schematic functional block diagram of storage-unit-failure checkout gear the 5th embodiment that Figure 10 is distributed file storage system of the present invention.
The scheme of the scheme of the 5th embodiment based on the 4th embodiment, in the 5th embodiment, described storage-unit-failure checkout gear also comprises:
The 4th determination module 90, for determining the quantity of failure node described in distributed file storage system;
The 5th determination module 100, while being less than the 3rd default threshold value for the quantity at failure node described in distributed file storage system, determines that described distributed file storage system is effective.
In the present embodiment, the 3rd default threshold values is preferably half of number of nodes in distributed file storage system, in distributed file storage system during node major part unavailable (having most of failure node), the 5th determination module 100 thinks that this distributed file storage system can not carry out transfer of data, determine that this distributed file storage system lost efficacy, now distributed file storage system is unavailable, then the failure detection of the memory cell of distributed file storage system has not been had to meaning.When in distributed file storage system, the quantity of failure node is less than the 3rd threshold value, the 5th determination module 100 just determines that distributed file storage system is effective, now just meaningful to the failure detection of the memory cell of distributed file storage system.After recording failure node and determining that this distributed file storage system lost efficacy, can send maintenance request to maintenance terminal (terminal of carrying as server and maintenance personal etc.), guarantee that failure node and distributed file storage system recover normal in time.
The storage-unit-failure checkout gear of the present embodiment, obtain successively the operation sign of memory cell of each node at acquisition module 10 before, first by the 5th determination module 100, determine that whether distributed file storage system is available, at distributed file storage system, determine when available and just the memory cell of the node of distributed file storage system is carried out to failure detection, avoided when distributed file storage system lost efficacy, distributed file storage system has been done to insignificant storage-unit-failure and detect.
These are only the preferred embodiments of the present invention; not thereby limit the scope of the claims of the present invention; every equivalent structure or conversion of equivalent flow process that utilizes specification of the present invention and accompanying drawing content to do; or be directly or indirectly used in other relevant technical fields, be all in like manner included in scope of patent protection of the present invention.

Claims (10)

1. a storage-unit-failure detection method for distributed file storage system, is characterized in that, the storage-unit-failure detection method of described distributed file storage system comprises the following steps:
Obtain successively the operation sign of the memory cell of each node;
When the operation sign of memory cell is obtained unsuccessfully, recording this memory cell is failed storage unit, and continues to obtain the operation sign of other memory cell of this memory cell place node, or, obtain successively the operation sign of the memory cell of other node.
2. the storage-unit-failure detection method of distributed file storage system as claimed in claim 1, is characterized in that, when the described sign of the operation in memory cell is obtained unsuccessfully, recording this memory cell is that failed storage unit comprises:
When the operation sign of memory cell is obtained unsuccessfully, restart this memory cell;
In default very first time interval, if memory cell is restarted unsuccessfully, recording this memory cell is failed storage unit.
3. the storage-unit-failure detection method of distributed file storage system as claimed in claim 1, it is characterized in that, when the described sign of the operation in memory cell is obtained unsuccessfully, recording this memory cell is failed storage unit, and the operation sign of other memory cell in this node is obtained in continuation successively, or after obtaining successively the step of operation sign of memory cell of other node, the storage-unit-failure detection method of described distributed file storage system also comprises:
Determine the quantity of failed storage unit described in distributed file storage system;
When the quantity of failed storage unit described in distributed file storage system is greater than first threshold, determine that described distributed file storage system lost efficacy.
4. the storage-unit-failure detection method of the distributed file storage system as described in any one in claim 1-3, it is characterized in that, before the step that described operation of obtaining successively the memory cell of each node identifies, the storage-unit-failure detection method of described distributed file storage system also comprises:
Control between the node in distributed file storage system and mutually send and detect packet;
Successively using arbitrary node in distributed file storage system as Section Point, other node as first node to determine the validity of Section Point;
Within default very first time interval, determine the quantity of the first node that does not receive response data packet, the detection packet that the described Section Point of described feedback data packet sends based on described first node feeds back;
When the quantity that does not receive the first node of response data packet is greater than the second default threshold values, recording described Section Point is failure node, and by described Section Point shielding.
5. the storage-unit-failure detection method of distributed file storage system as claimed in claim 4, it is characterized in that, when the described quantity not receiving the first node of response data packet is greater than the second default threshold values, recording described Section Point is failure node, and by after the step of described Section Point shielding, the storage-unit-failure detection method of described distributed file storage system also comprises:
Determine the quantity of failure node described in distributed file storage system;
When the quantity of failure node described in distributed file storage system is less than the 3rd default threshold value, determine that described distributed file storage system is effective.
6. a storage-unit-failure checkout gear for distributed file storage system, is characterized in that, the storage-unit-failure checkout gear of described distributed file storage system comprises:
Acquisition module, for obtaining successively the operation sign of the memory cell of each node, and while obtaining unsuccessfully for the operation sign in memory cell, continue to obtain the operation sign of other memory cell of this memory cell place node, or the operation of obtaining successively the memory cell of other node identifies;
Logging modle, while obtaining unsuccessfully for the operation sign in memory cell, recording this memory cell is failed storage unit.
7. the storage-unit-failure checkout gear of distributed file storage system as claimed in claim 6, is characterized in that, described logging modle comprises:
Restart unit, while obtaining unsuccessfully for the operation sign in memory cell, restart this memory cell;
Record cell, in the very first time interval default, if memory cell is restarted unsuccessfully, recording this memory cell is failed storage unit.
8. the storage-unit-failure checkout gear of distributed file storage system as claimed in claim 6, is characterized in that, the storage-unit-failure checkout gear of described distributed file storage system also comprises:
The first determination module, for obtaining the quantity of failed storage unit described in distributed file storage system;
The second determination module, while being greater than first threshold for the quantity in failed storage unit described in distributed file storage system, determines that described distributed file storage system lost efficacy.
9. the storage-unit-failure checkout gear of the distributed file storage system as described in any one in claim 6-8, is characterized in that, the storage-unit-failure checkout gear of described distributed file storage system also comprises:
Control module, mutually sends and detects packet for controlling between the node of distributed file storage system;
Node availability detection module, for successively using the arbitrary node of distributed file storage system as Section Point, other node as first node to determine the validity of Section Point;
The 3rd determination module, within the very first time interval default, determines the quantity of the first node that does not receive response data packet, and the detection packet that the described Section Point of described feedback data packet sends based on described first node feeds back;
Shroud module, while being greater than the second default threshold values for the quantity not receiving the first node of response data packet, recording described Section Point is failure node, and by described Section Point shielding.
10. the storage-unit-failure checkout gear of distributed file storage system as claimed in claim 9, is characterized in that, the storage-unit-failure checkout gear of described distributed file storage system also comprises:
The 4th determination module, for determining the quantity of failure node described in distributed file storage system;
The 5th determination module, while being less than the 3rd default threshold value for the quantity at failure node described in distributed file storage system, determines that described distributed file storage system is effective.
CN201410333913.5A 2014-07-14 2014-07-14 The storage-unit-failure detection method and device of distributed file storage system Active CN104158843B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410333913.5A CN104158843B (en) 2014-07-14 2014-07-14 The storage-unit-failure detection method and device of distributed file storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410333913.5A CN104158843B (en) 2014-07-14 2014-07-14 The storage-unit-failure detection method and device of distributed file storage system

Publications (2)

Publication Number Publication Date
CN104158843A true CN104158843A (en) 2014-11-19
CN104158843B CN104158843B (en) 2018-01-12

Family

ID=51884248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410333913.5A Active CN104158843B (en) 2014-07-14 2014-07-14 The storage-unit-failure detection method and device of distributed file storage system

Country Status (1)

Country Link
CN (1) CN104158843B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912446A (en) * 2016-04-29 2016-08-31 深圳市永兴元科技有限公司 Failure detection processing method and system for distributed data system
CN105975212A (en) * 2016-04-29 2016-09-28 深圳市永兴元科技有限公司 Failure detection processing method and device for distributed data system
CN106649555A (en) * 2016-11-08 2017-05-10 深圳市中博睿存科技有限公司 Memory unit state marking method and distributed memory system
CN109213637A (en) * 2018-11-09 2019-01-15 浪潮电子信息产业股份有限公司 Data reconstruction method, device and the medium of distributed file system clustered node

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101465880A (en) * 2007-12-18 2009-06-24 卢森特技术有限公司 Reliable storage of data in a distributed storage system
CN102571845A (en) * 2010-12-20 2012-07-11 南京中兴新软件有限责任公司 Data storage method and device of distributed storage system
CN103455395A (en) * 2013-08-08 2013-12-18 华为技术有限公司 Method and device for detecting hard disk failures
CN103490919A (en) * 2013-09-02 2014-01-01 用友软件股份有限公司 Fault management system and fault management method
CN103500140A (en) * 2013-09-27 2014-01-08 浪潮电子信息产业股份有限公司 Method for rapidly learning invalidation of distributed cluster nodes

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101465880A (en) * 2007-12-18 2009-06-24 卢森特技术有限公司 Reliable storage of data in a distributed storage system
CN102571845A (en) * 2010-12-20 2012-07-11 南京中兴新软件有限责任公司 Data storage method and device of distributed storage system
CN103455395A (en) * 2013-08-08 2013-12-18 华为技术有限公司 Method and device for detecting hard disk failures
CN103490919A (en) * 2013-09-02 2014-01-01 用友软件股份有限公司 Fault management system and fault management method
CN103500140A (en) * 2013-09-27 2014-01-08 浪潮电子信息产业股份有限公司 Method for rapidly learning invalidation of distributed cluster nodes

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912446A (en) * 2016-04-29 2016-08-31 深圳市永兴元科技有限公司 Failure detection processing method and system for distributed data system
CN105975212A (en) * 2016-04-29 2016-09-28 深圳市永兴元科技有限公司 Failure detection processing method and device for distributed data system
CN106649555A (en) * 2016-11-08 2017-05-10 深圳市中博睿存科技有限公司 Memory unit state marking method and distributed memory system
CN109213637A (en) * 2018-11-09 2019-01-15 浪潮电子信息产业股份有限公司 Data reconstruction method, device and the medium of distributed file system clustered node
CN109213637B (en) * 2018-11-09 2022-03-04 浪潮电子信息产业股份有限公司 Data recovery method, device and medium for cluster nodes of distributed file system

Also Published As

Publication number Publication date
CN104158843B (en) 2018-01-12

Similar Documents

Publication Publication Date Title
CN108847982B (en) Distributed storage cluster and node fault switching method and device thereof
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN104731670B (en) A kind of rotation formula spaceborne computer tolerant system towards satellite
CN103458086B (en) A kind of smart mobile phone and fault detection method thereof
CN103491134B (en) A kind of method of monitoring of containers, device and proxy server
CN105659215A (en) Fault processing method, related device and computer
CN111478796B (en) Cluster capacity expansion exception handling method for AI platform
US7886181B2 (en) Failure recovery method in cluster system
CN104158843A (en) Storage unit invalidation detecting method and device for distributed file storage system
CN111901176B (en) Fault determination method, device, equipment and storage medium
CN103823708A (en) Virtual machine read-write request processing method and device
CN111538613B (en) Cluster system exception recovery processing method and device
CN110659147B (en) Self-repairing method and system based on module self-checking behavior
CN106874126A (en) Host process method for detecting abnormality in a kind of software development
WO2023240944A1 (en) Data recovery method and apparatus, electronic device, and storage medium
CN115766402B (en) Method and device for filtering server fault root cause, storage medium and electronic device
CN105224426A (en) Physical host fault detection method, device and empty machine management method, system
CN116737444A (en) Database server fault processing method and system
CN105988885B (en) Operating system failure self-recovery method based on compensation rollback
US10674337B2 (en) Method and device for processing operation for device peripheral
CN111104266A (en) Access resource allocation method and device, storage medium and electronic equipment
CN105843336A (en) Rack with a plurality of rack management modules and method for updating firmware thereof
CN101369238A (en) Exception monitoring and reset processing method for USB equipment
CN108897645B (en) Database cluster disaster tolerance method and system based on standby heartbeat disk
CN107590647A (en) The servo supervisory systems of ship-handling system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20141119

Assignee: Liu Yi

Assignor: Shenzhen Zhongbo Kechuang Information Technology Co., Ltd.

Contract record no.: 2014440020487

Denomination of invention: Storage unit invalidation detecting method and device for distributed file storage system

License type: Common License

Record date: 20141230

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
EC01 Cancellation of recordation of patent licensing contract

Assignee: Liu Yi

Assignor: Shenzhen Zhongbo Kechuang Information Technology Co., Ltd.

Contract record no.: 2014440020487

Date of cancellation: 20161025

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190904

Address after: 100089 Floor 1-4, No. 2 Building, No. 9 Courtyard, Dijin Road, Haidian District, Beijing

Patentee after: Beijing Toyou Feiji Electronics Co., Ltd.

Address before: 518000 Room 1402, Feiyada Science and Technology Building, Nanshan District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen Zhongbo Kechuang Information Technology Co., Ltd.

TR01 Transfer of patent right