CN104158843B - The storage-unit-failure detection method and device of distributed file storage system - Google Patents

The storage-unit-failure detection method and device of distributed file storage system Download PDF

Info

Publication number
CN104158843B
CN104158843B CN201410333913.5A CN201410333913A CN104158843B CN 104158843 B CN104158843 B CN 104158843B CN 201410333913 A CN201410333913 A CN 201410333913A CN 104158843 B CN104158843 B CN 104158843B
Authority
CN
China
Prior art keywords
node
memory cell
storage system
distributed file
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410333913.5A
Other languages
Chinese (zh)
Other versions
CN104158843A (en
Inventor
李璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Toyou Feiji Electronics Co., Ltd.
Original Assignee
SHENZHEN ZHONGBO KECHUANG INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN ZHONGBO KECHUANG INFORMATION TECHNOLOGY Co Ltd filed Critical SHENZHEN ZHONGBO KECHUANG INFORMATION TECHNOLOGY Co Ltd
Priority to CN201410333913.5A priority Critical patent/CN104158843B/en
Publication of CN104158843A publication Critical patent/CN104158843A/en
Application granted granted Critical
Publication of CN104158843B publication Critical patent/CN104158843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of storage-unit-failure detection method of distributed file storage system, method includes:The operation mark of the memory cell of each node is obtained successively;When the operation of memory cell identifies and obtains failure, it is failed storage unit to record the memory cell, and continues the operation mark of other memory cell of node where obtaining the memory cell, or, the operation for obtaining the memory cell of other nodes successively identifies.The invention also discloses a kind of storage-unit-failure detection means of distributed file storage system.The storage-unit-failure detection method and device of distributed file storage system of the present invention, identified by the operation for obtaining memory cell in each node successively, to determine failed storage unit and record, the failed storage unit in the node of distributed file storage system can effectively be detected, so that user safeguards to the memory cell of failure in time, the reliability of distributed file storage system ensure that.

Description

The storage-unit-failure detection method and device of distributed file storage system
Technical field
The present invention relates to distributed file system failure detection field, more particularly to the storage of distributed file storage system Element failure detection method and device.
Background technology
In recent years, network distribution type storage has become the new trend of Development of storage technology.Distributed file system is structure Build the essential part of large-scale distributed storage system.Because data are distributed across the storage of different memory nodes On unit, even if certain several storage-unit-failure is unavailable, because these data are in some memory cell of other nodes On still exist, so accessed node still can normally access data, which provides the high reliability of data.Although data There is backup to store in other memory cell, but when the continuous cumulative rises of the memory cell of failure, may result in number According to loss, and then cause data normally to access, distributed file storage system failure is unavailable.
Therefore, need badly and a kind of scheme for detecting storage-unit-failure is provided, to find distributed file storage system in time In failed storage unit, consequently facilitating entering the timely replacing of line storage unit, ensure that the height of distributed file storage system can By property.
The content of the invention
The technology that can not detect failed storage unit it is a primary object of the present invention to solve distributed file storage system Problem.
To achieve the above object, the storage-unit-failure detection side of a kind of distributed file storage system provided by the invention Method, the storage-unit-failure detection method of the distributed file storage system comprise the following steps:
The operation mark of the memory cell of each node is obtained successively;
When the operation mark of memory cell obtains failure, it is failed storage unit to record the memory cell, and continues to obtain The operation mark of other memory cell of node where the memory cell is taken, or, the memory cell of other nodes is obtained successively Operation mark.
Preferably, it is described when the operation mark of memory cell obtains failure, it is single for failure storage to record the memory cell Member includes:
When the operation mark of memory cell obtains failure, the memory cell is restarted;
In default very first time interval, if memory cell is restarted unsuccessfully, it is single for failure storage to record the memory cell Member.
Preferably, it is described when the operation mark of memory cell obtains failure, it is single for failure storage to record the memory cell Member, and continue to obtain the operation mark of other memory cell in the node successively, or, the storage lists of other nodes is obtained successively After the step of operation mark of member, the storage-unit-failure detection method of the distributed file storage system also includes:
Determine the quantity of failed storage unit described in distributed file storage system;
When the quantity of the failed storage unit described in distributed file storage system is more than first threshold, it is determined that described point Cloth document storage system fails.
Preferably, before the step of operation mark of the memory cell for obtaining each node successively, the distribution The storage-unit-failure detection method of document storage system also includes:
Mutually transmission detects packet between controlling the node in distributed file storage system;
Successively using any node in distributed file storage system as section point, other nodes as first node with Determine the validity of section point;
Within default very first time interval, it is determined that the quantity of the first node of response data packet is not received, it is described Feedback data packet is that the section point is fed back based on the detection packet that the first node is sent;
When not receiving the quantity of first node of response data packet and being more than default second threshold values, described second is recorded Node is failure node, and the section point is shielded.
It is preferably, described when not receiving the quantity of first node of response data packet and being more than default second threshold values, It is failure node to record the section point, and the step of the section point is shielded after, the distributed document storage The storage-unit-failure detection method of system also includes:
Determine the quantity of failure node described in distributed file storage system;
When the quantity of the failure node described in distributed file storage system is less than default three threshold value, it is determined that described Distributed file storage system is effective.
In addition, to achieve the above object, the present invention also provides a kind of storage-unit-failure of distributed file storage system Detection means, the storage-unit-failure detection means of the distributed file storage system include:
Acquisition module, the operation mark of the memory cell for obtaining each node successively, and in memory cell Operation mark when obtaining failure, the operation mark of other memory cell of node where continuing to obtain the memory cell, or, The operation mark of the memory cell of other nodes is obtained successively;
Logging modle, for when the operation mark of memory cell obtains failure, recording the memory cell and being stored for failure Unit.
Preferably, the logging modle includes:
Unit is restarted, for when the operation mark of memory cell obtains failure, restarting the memory cell;
Recording unit, in default very first time interval, if memory cell is restarted unsuccessfully, recording the memory cell For failed storage unit.
Preferably, the storage-unit-failure detection means of the distributed file storage system also includes:
First determining module, for obtaining the quantity of failed storage unit described in distributed file storage system;
Second determining module, the quantity for the failed storage unit described in distributed file storage system are more than first During threshold value, the distributed file storage system failure is determined.
Preferably, the storage-unit-failure detection means of the distributed file storage system also includes:
Control module, packet is detected for mutually transmission between controlling the node in distributed file storage system;
Node availability detection module, for successively using any node in distributed file storage system as the second section Point, other nodes are as first node to determine the validity of section point;
3rd determining module, within default very first time interval, it is determined that not receiving the of response data packet The quantity of one node, the feedback data packet are that the section point is anti-based on the detection packet that the first node is sent Feedback;
Shroud module, for being more than default second threshold values in the quantity for not receiving the first node of response data packet When, it is failure node to record the section point, and the section point is shielded.
Preferably, the storage-unit-failure detection means of the distributed file storage system also includes:
4th determining module, for determining the quantity of failure node described in distributed file storage system;
5th determining module, the quantity for the failure node described in distributed file storage system are less than default the During three threshold values, determine that the distributed file storage system is effective.
The storage-unit-failure detection method and device of the distributed file storage system of the present invention, it is each by obtaining successively The operation mark of memory cell in individual node, to determine failed storage unit and record, can effectively detect distributed text Failed storage unit in the node of part storage system, so that user safeguards to the memory cell of failure in time, it ensure that The reliability of distributed file storage system.
Brief description of the drawings
Fig. 1 is that the flow of the storage-unit-failure detection method first embodiment of distributed file storage system of the present invention is shown It is intended to;
Fig. 2 is that the flow of the storage-unit-failure detection method second embodiment of distributed file storage system of the present invention is shown It is intended to;
Fig. 3 is that the flow of the storage-unit-failure detection method 3rd embodiment of distributed file storage system of the present invention is shown It is intended to;
Fig. 4 is that the flow of the storage-unit-failure detection method fourth embodiment of distributed file storage system of the present invention is shown It is intended to;
Fig. 5 is that the flow of the embodiment of storage-unit-failure detection method the 5th of distributed file storage system of the present invention is shown It is intended to;
Fig. 6 is the function mould of the storage-unit-failure detection means first embodiment of distributed file storage system of the present invention Block schematic diagram;
Fig. 7 is the function mould of the storage-unit-failure detection means second embodiment of distributed file storage system of the present invention Block schematic diagram;
Fig. 8 is the function mould of the storage-unit-failure detection means 3rd embodiment of distributed file storage system of the present invention Block schematic diagram;
Fig. 9 is the function mould of the storage-unit-failure detection means fourth embodiment of distributed file storage system of the present invention Block schematic diagram;
Figure 10 is the function of the embodiment of storage-unit-failure detection means the 5th of distributed file storage system of the present invention Module diagram.
The realization, functional characteristics and advantage of the object of the invention will be described further referring to the drawings in conjunction with the embodiments.
Embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The present invention provides a kind of storage-unit-failure detection method of distributed file storage system (in following description referred to as For storage-unit-failure detection method).
Reference picture 1, Fig. 1 are the storage-unit-failure detection method first embodiment of distributed file storage system of the present invention Schematic flow sheet.
In the first embodiment, the storage-unit-failure detection method includes:
Step S10, the operation mark of the memory cell of each node is obtained successively;
Each node includes multiple memory cell in distributed file storage system, and each memory cell operationally, has Operation mark corresponding to unique (such as process number of operation process).When distributed file storage system is run, each node work Make, the memory cell operation work in node, then can get operation mark corresponding to the memory cell in node;If node In some memory cell do not run work, then obtain in node less than corresponding to the memory cell run mark.By successively The operation mark of the memory cell of each node is obtained, to determine to run the memory cell of work in node, and does not run work The memory cell of work, that is, the memory cell of failure be present.
Step S20, when the operation mark of memory cell obtains failure, it is failed storage unit to record the memory cell, And continue the operation mark of other memory cell of node where obtaining the memory cell, or, other nodes are obtained successively The operation mark of memory cell.
When the operation mark for having memory cell obtains failure, that is, when obtaining the operation mark less than the memory cell, say The bright memory cell is not run, and failure is present and can not be used, and is failed storage unit by the unit records;And continue The operation mark of the operation mark of other memory cell of node where obtaining the memory cell or the memory cell of other nodes. After failed storage unit is recorded, it can send and safeguard to maintenance terminal (terminal that such as server and maintenance personal carry) Request, failed storage unit is repaired or replaced in time for prompting, to ensure the reliability of distributed file storage system.
The storage-unit-failure detection method that the present embodiment proposes, by the fortune for obtaining memory cell in each node successively Line identifier, to determine failed storage unit and record, in the node that can effectively detect distributed file storage system Failed storage unit, so that user safeguards to the memory cell of failure in time, it ensure that distributed file storage system Reliability.
Reference picture 2, Fig. 2 are the storage-unit-failure detection method second embodiment of distributed file storage system of the present invention Schematic flow sheet.
Scheme of the scheme of second embodiment based on first embodiment, in a second embodiment, storage-unit-failure inspection In the step S20 of survey method when the operation mark of memory cell obtains failure, it is failed storage unit to record the memory cell Including:
Step S21, when the operation mark of memory cell obtains failure, restart the memory cell;
Because some memory cell can not run the failure problems of work, can be solved by restarting, it is recovered normal fortune Row work, therefore when the operation mark for having memory cell obtains failure, the memory cell is first restarted, so that partial memory cell Normal operation can be recovered immediately, distributed file storage system is kept the memory cell operation work for trying one's best more, guarantee point The maximum reliability of cloth document storage system operation, and reduce the maintenance workload of attendant.
Step S22, in default very first time interval, if memory cell is restarted unsuccessfully, the memory cell is recorded to lose Imitate memory cell.
The memory cell that usually can solve failure by restarting can restart success within the stipulated time (very first time), no The memory cell that can solve failure by restarting can not restart success in regulation.Reboot operation is performed to the memory cell of failure The very first time result of reboot operation is returned behind interval, if what is returned restarts result (i.e. the memory cell exists to restart unsuccessfully Restart unsuccessfully within default very first time interval), then it is failed storage unit to record the memory cell;If what is returned restarts knot Fruit is restarts successfully (i.e. the memory cell restarts success within default very first time interval), then the memory cell is recovered just Often operation, can now get the memory cell operation mark, and then judge the memory cell be it is effective, then continue according to The operation mark of the secondary other memory cell for obtaining memory cell place node, or, the storage of other nodes is obtained successively The operation mark of unit.
The storage-unit-failure detection method of the present embodiment, when the operation mark of memory cell obtains failure, to storage Unit is restarted, so as to can solve the memory cell of failure problems by restarting immediately by restarting recovery normal operation, By can not restart solve failure problems the unit record of storage degree be failed storage unit, keep distributed file storage system More memory cell operation work as far as possible, ensures the reliability of distributed file storage system, and reduce the maintenance of attendant Workload.
Reference picture 3, Fig. 3 are the storage-unit-failure detection method 3rd embodiment of distributed file storage system of the present invention Schematic flow sheet.
Scheme of the scheme of 3rd embodiment based on first embodiment or second embodiment, in the third embodiment, in step After rapid S20, storage-unit-failure detection method also includes:
Step S30, determine the quantity of failed storage unit described in distributed file storage system;
The operation mark of the memory cell of each node obtains successively in distributed file storage system completes and records After going out all failed storage units, it is determined that the total quantity of the failed storage unit of record.
Step S40, when the quantity of the failed storage unit described in distributed file storage system is more than first threshold, really The fixed distributed file storage system failure.
Default first threshold is preferably all available node (sections not failed in distributed file storage system Point) memory cell total half.The quantity of failed storage unit described in distributed file storage system is more than first During threshold value, then it is assumed that the data transfer of distributed file storage system and access exception easily occur and (can not access data or visit The data asked are incorrect etc.), the reliability of the data of distributed file storage system is low, now determines distributed document storage Thrashing, it is out of service.
The storage-unit-failure detection method of the present embodiment, failed storage unit in distributed file storage system When quantity exceedes first threshold, distributed file storage system is defined as failing, stops distributed file storage system operation, Avoid distributed file storage system from continuing to run with and cause loss of data and access data exception.
Reference picture 4, Fig. 4 are the storage-unit-failure detection method fourth embodiment of distributed file storage system of the present invention Schematic flow sheet.
The scheme of fourth embodiment based on first into 3rd embodiment any embodiment scheme, in fourth embodiment In, before step S10, storage-unit-failure detection method also includes:
Step S50, mutual send detects packet between controlling the node in distributed file storage system;
In the present embodiment, mutual send detects packet between can control each node, to ensure that distributed document is deposited In storage system between each node running status mutual detection.
Step S60, successively using any node in distributed file storage system as section point, other nodes are as One node is to determine the validity of section point;
For example, there is tetra- nodes of A, B, C, D in distributed file storage system, using B node as section point, then A, C, tri- nodes of D are first node, judge whether B node is effective, can be according to default after judging whether B node is effective Order continues to judge whether C nodes effective, the like up to having detected all nodes.
Step S70, it is determined that within default very first time interval, the number of the first node of response data packet is not received Amount, the feedback data packet is that the section point is fed back based on the detection packet that first node is sent;
In the present embodiment, section point is parsed to determine when receiving packet to the packet received The type of the packet received, when the packet received is detects packet, to the first node feedback response number According to bag.Due to communication link fails be present, then first node does not receive the feedback data that section point is sent and included A variety of situations:A, communication link breaks down;B, first node, which breaks down, does not send detection packet;C, section point goes out Existing failure does not send feedback data packet.
In the present embodiment, it is determined that not receiving the step of the quantity of the first node of the response data packet of section point feedback Suddenly can be realized by following scheme:A, when first node does not receive response data packet in default very first time interval, note It is insincere node that section point, which is recorded, relative to first node, and records the mark (such as title and code) of first node, Then the quantity of the mark of the first node of the record does not receive the first node of the response data packet of section point feedback as Quantity;B, when first node does not receive response data packet in default very first time interval, the section point is recorded For insincere node.This, which records the step of insincere node, to be accomplished in several ways, for example, establishing trusted node database And insincere node database, when section point is recorded as into insincere node, identified (such as title and code Deng) be added in insincere node database;Or when section point is recorded as into insincere node, to the described second section Point adds insincere mark, and it is insincere degree of node to obtain record section point, and the record section point is insincere Degree of node does not receive the quantity of the first node of the response data packet of section point feedback as.
Step S80, when not receiving the quantity of first node of response data packet and being more than default second threshold values, record The section point is failure node, and the section point is shielded.
Failure node can not use, nonsensical to the storage-unit-failure detection in failure node, and The quantity of memory cell is more in node, and in order to improve the efficiency of storage-unit-failure detection method, therefore the present embodiment will The failure node detected is shielded so that is not obtained the operation mark of the memory cell of failure node, is avoided insignificant Detection.Second threshold values can be set by user, preferred scheme for the quantity of first node half, to ensure most of the When one node does not receive the response data packet of section point feedback, record section point is failure node, and by failure node Shielding.
The storage element abatement detecting method that the present embodiment proposes, in the operation for the memory cell for obtaining each node successively Before mark, first detect the failure node in distributed file storage system and mask failure node, do not obtain failure The operation mark of the memory cell of node, i.e., do not carry out storage-unit-failure detection to failure node, it is single that storage greatly improved The efficiency of first failure detection.
Reference picture 5, Fig. 5 are the embodiment of storage-unit-failure detection method the 5th of distributed file storage system of the present invention Schematic flow sheet.
5th scheme of the embodiment based on fourth embodiment, in the 5th embodiment, after step S80 and in step Before S10, storage-unit-failure detection method also includes:
Step S90, determine the quantity of failure node described in distributed file storage system;
Step S100, the quantity of the failure node described in distributed file storage system are less than default 3rd threshold value When, determine that the distributed file storage system is effective.
In the present embodiment, default 3rd threshold values is preferably the half of distributed file storage system interior joint quantity, At distributed file storage system interior joint largely unavailable (i.e. in the presence of most of failure node), then it is assumed that the distribution Document storage system has been not available for data transfer, determines that the distributed file storage system fails, now distributed document is deposited Storage system is unavailable, then to the failure detection of the memory cell of distributed file storage system without meaning. When the quantity of failure node is less than three threshold values in distributed file storage system, distributed file storage system is just defined as having Effect, it is now just significant to the failure detection of the memory cell of distributed file storage system.In record failure node and really , can be to maintenance terminal (terminal that such as server and maintenance personal carry) after fixed distributed file storage system failure Send maintenance request, it is ensured that failure node and distributed file storage system recover normal in time.
The storage-unit-failure detection method of the present embodiment, identified in the operation for obtaining the memory cell of each node successively Before, first determine whether distributed file storage system can use, when distributed file storage system can use just to distributed text The memory cell of the node of part storage system carries out failure detection, avoids when distributed file storage system has failed, Insignificant storage-unit-failure detection is done to distributed file storage system.
The present invention also provides a kind of storage-unit-failure detection means (letter in following description of distributed file storage system Referred to as storage-unit-failure detection means).
Reference picture 6, Fig. 6 are the storage-unit-failure detection means first embodiment of distributed file storage system of the present invention High-level schematic functional block diagram.
In the first embodiment, the storage-unit-failure detection means includes:
Acquisition module 10, the operation mark of the memory cell for obtaining each node successively, and for single in storage When the operation mark of member obtains failure, continue the operation mark of other memory cell of node where obtaining the memory cell, or Person, the operation mark of the memory cell of other nodes is obtained successively;
Each node includes multiple memory cell in distributed file storage system, and each memory cell operationally, has Operation mark corresponding to unique (such as process number of operation process).When distributed file storage system is run, each node work Make, the memory cell operation work in node, then can get operation mark corresponding to the memory cell in node;If node In some memory cell do not run work, then obtain in node less than corresponding to the memory cell run mark.Pass through acquisition Module 10 obtains the operation mark of the memory cell of each node successively, to determine to run the memory cell of work in node, with And do not run the memory cell of work, that is, the memory cell that failure be present.
Logging modle 20, for when the operation mark of memory cell obtains failure, recording the memory cell and being deposited for failure Storage unit.
When the operation that acquisition module 10 obtains memory cell identifies failure, i.e., acquisition module 10 is obtained less than the storage list During the operation mark of member, illustrate that the memory cell is not run, failure be present and can not use, then logging modle 20 stores this Unit record is failed storage unit;And acquisition module 10 continues other memory cell of node where obtaining the memory cell Operation mark or other nodes memory cell operation mark., can after logging modle 20 records failed storage unit Maintenance request is sent to maintenance terminal (terminal that such as server and maintenance personal carry), is reminded single to failure storage in time Member is repaired or replaced, to ensure the reliability of distributed file storage system.
The storage-unit-failure detection means that the present embodiment proposes, is obtained in each node and deposited successively by acquisition module 10 The operation mark of storage unit, to determine failed storage unit and be recorded by logging modle 20, can effectively be detected Failed storage unit in the node of distributed file storage system, so that user ties up to the memory cell of failure in time Shield, ensure that the reliability of distributed file storage system.
Reference picture 7, Fig. 7 are the storage-unit-failure detection means second embodiment of distributed file storage system of the present invention High-level schematic functional block diagram.
Scheme of the scheme of second embodiment based on first embodiment, in a second embodiment, the storage-unit-failure The logging modle 20 of detection means includes:
Unit 21 is restarted, for when the operation mark of memory cell obtains failure, restarting the memory cell;
Because some memory cell can not run the failure problems of work, can be solved by restarting, it is recovered normal fortune Row work, therefore when the operation that acquisition module 10 obtains memory cell identifies failure, restart unit 21 and restart the memory cell, So that some memory cell can recover normal operation immediately, distributed file storage system is set to keep more memory cell of trying one's best Work is run, ensures the maximum reliability of distributed file storage system operation, and reduces the maintenance workload of attendant.
Recording unit 22, in default very first time interval, if memory cell is restarted unsuccessfully, recording the storage list Member is failed storage unit.
The memory cell that solution failure usually can be restarted by restarting unit 21 can be within the stipulated time (very first time) Restart success, it is impossible to which the memory cell of solution failure is restarted by restarting unit 21 can not restart success in regulation.To failure Memory cell perform very first time of reboot operation the result of reboot operation returned behind interval, if the result of restarting returned is attached most importance to Failure (i.e. the memory cell is restarted unsuccessfully within default very first time interval) is opened, then recording unit 22 records the storage list Member is failed storage unit;If (i.e. the memory cell is at default very first time interval to restart successfully for the result of restarting returned Within restart success), then the memory cell recover normal operation, now acquisition module 10 can get the fortune of the memory cell Line identifier, and then judge that the memory cell is effective, then acquisition module 10 continues its of node where obtaining the memory cell The operation mark of its memory cell, or, the operation for obtaining the memory cell of other nodes successively identifies.
The storage-unit-failure detection means of the present embodiment, when the operation mark of memory cell obtains failure, restart list First 21 pairs of memory cell are restarted, so that the memory cell that can solve failure problems by restarting is recovered by restarting immediately Normal operation, it is failed storage unit that recording unit 22, which will can not restart the storage degree unit record that solves failure problems, is made point Cloth document storage system keeps the memory cell operation work for trying one's best more, ensures the reliability of distributed file storage system, And reduce the maintenance workload of attendant.
Reference picture 8, Fig. 8 are the storage-unit-failure detection means 3rd embodiment of distributed file storage system of the present invention High-level schematic functional block diagram.
Scheme of the scheme of 3rd embodiment based on first or second embodiments, in the third embodiment, the storage are single First failure detection device also includes:
First determining module 30, for obtaining the quantity of failed storage unit described in distributed file storage system;
Acquisition module 10 obtains the operation for the memory cell for completing each node successively in distributed file storage system Mark, and logging modle 20 record out it is all failure storage units after, the first determining module 30 determine record failed storage unit Total quantity.
Second determining module 40, the quantity for the failed storage unit described in distributed file storage system are more than the During one threshold value, the distributed file storage system failure is determined.
Default first threshold is preferably all available node (sections not failed in distributed file storage system Point) memory cell total half.Second determining module 40 fails described in distributed file storage system stores list When the quantity of member exceedes first threshold, then it is assumed that easily appearance is abnormal for the data transfer of distributed file storage system and access (data that can not access data or access are incorrect etc.), the reliability of the data of distributed file storage system is low, now Determine that distributed file storage system fails, it is out of service.
Reference picture 9, Fig. 9 are the storage-unit-failure detection means fourth embodiment of distributed file storage system of the present invention High-level schematic functional block diagram.
The scheme of fourth embodiment based on first into 3rd embodiment any embodiment scheme, in fourth embodiment In, the storage-unit-failure detection means also includes:
Control module 50, packet is detected for mutually transmission between controlling the node in distributed file storage system;
In the present embodiment, control module 50 sends mutually detection packet between can control each node, to ensure to divide In cloth document storage system between each node running status mutual detection.
Node availability detection module 60, for successively using any node in distributed file storage system as the second section Point, other nodes are as first node to determine the validity of section point;
For example, there is tetra- nodes of A, B, C, D in distributed file storage system, node availability detection module 60 is by B For node as section point, then tri- nodes of A, C, D are first node, judge whether B node is effective, are judging that B node is It is no effectively after, node availability detection module 60 can continue to judge whether C nodes effective according to default order, the like directly To having detected all nodes.
3rd determining module 70, within default very first time interval, it is determined that not receiving response data packet The quantity of first node, the feedback data packet be the section point based on the detection packet that the first node is sent and Feedback;
In the present embodiment, section point is parsed to determine when receiving packet to the packet received The type of the packet received, when the packet received is detects packet, to the first node feedback response number According to bag.Due to communication link fails be present, then first node does not receive the feedback data that section point is sent and included A variety of situations:A, communication link breaks down;B, first node, which breaks down, does not send detection packet;C, section point goes out Existing failure does not send feedback data packet.
In the present embodiment, the 3rd determining module 70 determines not receive the first of the response data packet of section point feedback The step of quantity of node, can be realized by following scheme:A, first node does not receive sound in default very first time interval When answering packet, record section point is insincere node relative to first node, and records mark (such as title of first node And code etc.), then the quantity of the mark of the first node of the record does not receive the response data of section point feedback as The quantity of the first node of bag;B, when first node does not receive response data packet in default very first time interval, record The section point is insincere node.This, which records the step of insincere node, to be accomplished in several ways, can for example, establishing Believe node database and insincere node database, when section point is recorded as into insincere node, identified (such as name Title and code etc.) it is added in insincere node database;Or when section point is recorded as into insincere node, give The section point adds insincere mark, and it is insincere degree of node to obtain record section point, and the record second saves Point does not receive the quantity of the first node of the response data packet of section point feedback as insincere degree of node.
Shroud module 80, for being more than default second threshold values in the quantity for not receiving the first node of response data packet When, it is failure node to record the section point, and the section point is shielded.
Failure node can not use, nonsensical to the storage-unit-failure detection in failure node, and The quantity of memory cell is more in node, in order to improve the efficiency of storage-unit-failure detection method, therefore shroud module 80 The failure node detected is shielded so that do not obtain the operation mark of the memory cell of failure node, avoid meaningless Detection.Second threshold values can be set by user, and preferred scheme is the half of the quantity of first node, to ensure in major part When first node does not receive the response data packet of section point feedback, it is failure node that shroud module 80, which records section point, And failure node is shielded.
The storage element failure detection device that the present embodiment proposes, the storage of each node is obtained in acquisition module 10 successively Before the operation mark of unit, first detect the failure node in distributed file storage system and will be lost by shroud module 80 Effect node shield falls, and does not obtain the operation mark of the memory cell of failure node, i.e., does not enter line storage unit mistake to failure node Effect detection, greatly improved the efficiency of storage-unit-failure detection.
Reference picture 10, Figure 10 are that the storage-unit-failure detection means the 5th of distributed file storage system of the present invention is implemented The high-level schematic functional block diagram of example.
Scheme of the scheme based on fourth embodiment of 5th embodiment, in the 5th embodiment, the storage-unit-failure Detection means also includes:
4th determining module 90, for determining the quantity of failure node described in distributed file storage system;
5th determining module 100, the quantity for the failure node described in distributed file storage system are less than default Three threshold values when, determine that the distributed file storage system is effective.
In the present embodiment, default 3rd threshold values is preferably the half of distributed file storage system interior joint quantity, At distributed file storage system interior joint largely unavailable (i.e. in the presence of most of failure node), then the 5th determining module 100 think that the distributed file storage system has been not available for data transfer, determine that the distributed file storage system fails, this When distributed file storage system it is unavailable, then to the failure detection of the memory cell of distributed file storage system Without meaning.When the quantity of failure node is less than three threshold values in distributed file storage system, the 5th determining module 100 Just determine that distributed file storage system is effective, now just has to the failure detection of the memory cell of distributed file storage system Meaning., can be to maintenance terminal (such as server after recording failure node and determining distributed file storage system failure And terminal of maintenance personal's carrying etc.) sending maintenance request, it is ensured that failure node and distributed file storage system are timely Recover normal.
The storage-unit-failure detection means of the present embodiment, the memory cell of each node is obtained successively in acquisition module 10 Operation mark before, first pass through the 5th determining module 100 and determine whether distributed file storage system can use, in distributed text Failure detection just is carried out to the memory cell of the node of distributed file storage system when part storage system determines available, avoided When distributed file storage system has failed, insignificant storage-unit-failure inspection is done to distributed file storage system Survey.
The preferred embodiments of the present invention are these are only, are not intended to limit the scope of the invention, it is every to utilize this hair The equivalent structure or equivalent flow conversion that bright specification and accompanying drawing content are made, or directly or indirectly it is used in other related skills Art field, is included within the scope of the present invention.

Claims (8)

  1. A kind of 1. storage-unit-failure detection method of distributed file storage system, it is characterised in that the distributed document The storage-unit-failure detection method of storage system comprises the following steps:
    The operation mark of the memory cell of each node is obtained successively;
    When the operation mark of memory cell obtains failure, it is failed storage unit to record the memory cell, and continues to obtain and be somebody's turn to do The operation mark of other memory cell of node where memory cell, or, the fortune of the memory cell of other nodes is obtained successively Line identifier;
    Before the step of operation mark of the memory cell for obtaining each node successively, the distributed file storage system Storage-unit-failure detection method also include:
    Mutually transmission detects packet between controlling the node in distributed file storage system;
    Successively using any node in distributed file storage system as section point, other nodes are as first node to determine The validity of section point;
    Within default very first time interval, it is determined that the quantity of the first node of response data packet is not received, the response Packet is that the section point is fed back based on the detection packet that the first node is sent;
    When not receiving the quantity of first node of response data packet and being more than default second threshold values, the section point is recorded For failure node, and the section point is shielded.
  2. 2. the storage-unit-failure detection method of distributed file storage system as claimed in claim 1, it is characterised in that institute State when the operation mark of memory cell obtains failure, record the memory cell includes for failed storage unit:
    When the operation mark of memory cell obtains failure, the memory cell is restarted;
    In default very first time interval, if memory cell is restarted unsuccessfully, it is failed storage unit to record the memory cell.
  3. 3. the storage-unit-failure detection method of distributed file storage system as claimed in claim 1, it is characterised in that institute State when the operation mark of memory cell obtains failure, it is failed storage unit to record the memory cell, and continues to obtain successively The operation mark of other memory cell in the node, or, the step that the operation of the memory cell of other nodes identifies is obtained successively After rapid, the storage-unit-failure detection method of the distributed file storage system also includes:
    Determine the quantity of failed storage unit described in distributed file storage system;
    When the quantity of the failed storage unit described in distributed file storage system is more than first threshold, the distribution is determined Document storage system fails.
  4. 4. the storage-unit-failure detection method of distributed file storage system as claimed in claim 1, it is characterised in that institute State when not receiving the quantity of first node of response data packet and being more than default second threshold values, recording the section point is Failure node, and the step of the section point is shielded after, the storage-unit-failure of the distributed file storage system Detection method also includes:
    Determine the quantity of failure node described in distributed file storage system;
    When the quantity of the failure node described in distributed file storage system is less than default three threshold value, the distribution is determined Formula document storage system is effective.
  5. A kind of 5. storage-unit-failure detection means of distributed file storage system, it is characterised in that the distributed document The storage-unit-failure detection means of storage system includes:
    Acquisition module, the operation mark of the memory cell for obtaining each node successively, and for the fortune in memory cell When line identifier obtains failure, continue the operation mark of other memory cell of node where obtaining the memory cell, or, successively Obtain the operation mark of the memory cell of other nodes;
    Logging modle, for when the operation mark of memory cell obtains failure, it to be failed storage unit to record the memory cell;
    The storage-unit-failure detection means of the distributed file storage system also includes:
    Control module, packet is detected for mutually transmission between controlling the node in distributed file storage system;
    Node availability detection module, for successively using any node in distributed file storage system as section point, its Its node is as first node to determine the validity of section point;
    3rd determining module, within default very first time interval, it is determined that not receiving the first segment of response data packet The quantity of point, the response data packet is that the section point is fed back based on the detection packet that the first node is sent 's;
    Shroud module, for when not receiving the quantity of first node of response data packet and being more than default second threshold values, note It is failure node to record the section point, and the section point is shielded.
  6. 6. the storage-unit-failure detection means of distributed file storage system as claimed in claim 5, it is characterised in that institute Stating logging modle includes:
    Unit is restarted, for when the operation mark of memory cell obtains failure, restarting the memory cell;
    Recording unit, in default very first time interval, if memory cell is restarted unsuccessfully, recording the memory cell to lose Imitate memory cell.
  7. 7. the storage-unit-failure detection means of distributed file storage system as claimed in claim 5, it is characterised in that institute Stating the storage-unit-failure detection means of distributed file storage system also includes:
    First determining module, for obtaining the quantity of failed storage unit described in distributed file storage system;
    Second determining module, the quantity for the failed storage unit described in distributed file storage system are more than first threshold When, determine the distributed file storage system failure.
  8. 8. the storage-unit-failure detection means of distributed file storage system as claimed in claim 5, it is characterised in that institute Stating the storage-unit-failure detection means of distributed file storage system also includes:
    4th determining module, for determining the quantity of failure node described in distributed file storage system;
    5th determining module, the quantity for the failure node described in distributed file storage system are less than default 3rd threshold During value, determine that the distributed file storage system is effective.
CN201410333913.5A 2014-07-14 2014-07-14 The storage-unit-failure detection method and device of distributed file storage system Active CN104158843B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410333913.5A CN104158843B (en) 2014-07-14 2014-07-14 The storage-unit-failure detection method and device of distributed file storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410333913.5A CN104158843B (en) 2014-07-14 2014-07-14 The storage-unit-failure detection method and device of distributed file storage system

Publications (2)

Publication Number Publication Date
CN104158843A CN104158843A (en) 2014-11-19
CN104158843B true CN104158843B (en) 2018-01-12

Family

ID=51884248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410333913.5A Active CN104158843B (en) 2014-07-14 2014-07-14 The storage-unit-failure detection method and device of distributed file storage system

Country Status (1)

Country Link
CN (1) CN104158843B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912446A (en) * 2016-04-29 2016-08-31 深圳市永兴元科技有限公司 Failure detection processing method and system for distributed data system
CN105975212A (en) * 2016-04-29 2016-09-28 深圳市永兴元科技有限公司 Failure detection processing method and device for distributed data system
CN106649555A (en) * 2016-11-08 2017-05-10 深圳市中博睿存科技有限公司 Memory unit state marking method and distributed memory system
CN109213637B (en) * 2018-11-09 2022-03-04 浪潮电子信息产业股份有限公司 Data recovery method, device and medium for cluster nodes of distributed file system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101465880A (en) * 2007-12-18 2009-06-24 卢森特技术有限公司 Reliable storage of data in a distributed storage system
CN102571845A (en) * 2010-12-20 2012-07-11 南京中兴新软件有限责任公司 Data storage method and device of distributed storage system
CN103455395A (en) * 2013-08-08 2013-12-18 华为技术有限公司 Method and device for detecting hard disk failures
CN103490919A (en) * 2013-09-02 2014-01-01 用友软件股份有限公司 Fault management system and fault management method
CN103500140A (en) * 2013-09-27 2014-01-08 浪潮电子信息产业股份有限公司 Method for rapidly learning invalidation of distributed cluster nodes

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101465880A (en) * 2007-12-18 2009-06-24 卢森特技术有限公司 Reliable storage of data in a distributed storage system
CN102571845A (en) * 2010-12-20 2012-07-11 南京中兴新软件有限责任公司 Data storage method and device of distributed storage system
CN103455395A (en) * 2013-08-08 2013-12-18 华为技术有限公司 Method and device for detecting hard disk failures
CN103490919A (en) * 2013-09-02 2014-01-01 用友软件股份有限公司 Fault management system and fault management method
CN103500140A (en) * 2013-09-27 2014-01-08 浪潮电子信息产业股份有限公司 Method for rapidly learning invalidation of distributed cluster nodes

Also Published As

Publication number Publication date
CN104158843A (en) 2014-11-19

Similar Documents

Publication Publication Date Title
CN110807064B (en) Data recovery device in RAC distributed database cluster system
CN104158843B (en) The storage-unit-failure detection method and device of distributed file storage system
CN102902615B (en) A kind of Lustre parallel file system false alarm method and system thereof
CN108429629A (en) Equipment fault restoration methods and device
CN105095008B (en) A kind of distributed task scheduling fault redundance method suitable for group system
CN106933693A (en) A kind of data-base cluster node failure self-repairing method and system
CN111327685A (en) Data processing method, device and equipment of distributed storage system and storage medium
CN111478796A (en) Cluster capacity expansion exception handling method for AI platform
CN105915426A (en) Failure recovery method and device of ring network
CN103995901B (en) A kind of method for determining back end failure
CN105812161A (en) Controller fault backup method and system
CN106776251A (en) A kind of monitoring data processing unit and method
CN114490565A (en) Database fault processing method and device
CN107656847A (en) Node administration method, system, device and storage medium based on distributed type assemblies
CN107483238A (en) A kind of blog management method, cluster management node and system
US20130090760A1 (en) Apparatus and method for managing robot components
CN106557380A (en) For the method that keeps server stable and its system
CN106682040A (en) Data management method and device
CN111309515B (en) Disaster recovery control method, device and system
JP2009025971A (en) Information processor and log data collection system
CN116737444A (en) Database server fault processing method and system
JP2010009127A (en) Management program and management device
JP2015176168A (en) Administration server, fault restoration method, and computer program
KR20130042438A (en) Method and apparatus for managing rfid resource
US9348701B2 (en) Method and apparatus for failure recovery in a machine-to-machine network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20141119

Assignee: Liu Yi

Assignor: Shenzhen Zhongbo Kechuang Information Technology Co., Ltd.

Contract record no.: 2014440020487

Denomination of invention: Storage unit invalidation detecting method and device for distributed file storage system

License type: Common License

Record date: 20141230

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
EC01 Cancellation of recordation of patent licensing contract

Assignee: Liu Yi

Assignor: Shenzhen Zhongbo Kechuang Information Technology Co., Ltd.

Contract record no.: 2014440020487

Date of cancellation: 20161025

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190904

Address after: 100089 Floor 1-4, No. 2 Building, No. 9 Courtyard, Dijin Road, Haidian District, Beijing

Patentee after: Beijing Toyou Feiji Electronics Co., Ltd.

Address before: 518000 Room 1402, Feiyada Science and Technology Building, Nanshan District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen Zhongbo Kechuang Information Technology Co., Ltd.