CN107273231A - Distributed memory system hard disk tangles fault detect, processing method and processing device - Google Patents

Distributed memory system hard disk tangles fault detect, processing method and processing device Download PDF

Info

Publication number
CN107273231A
CN107273231A CN201610212740.0A CN201610212740A CN107273231A CN 107273231 A CN107273231 A CN 107273231A CN 201610212740 A CN201610212740 A CN 201610212740A CN 107273231 A CN107273231 A CN 107273231A
Authority
CN
China
Prior art keywords
hard disk
target hard
target
failure
tangled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610212740.0A
Other languages
Chinese (zh)
Inventor
王勇
赵树起
朱家稷
董乘宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610212740.0A priority Critical patent/CN107273231A/en
Priority to TW106107797A priority patent/TW201737111A/en
Priority to PCT/CN2017/077995 priority patent/WO2017173927A1/en
Publication of CN107273231A publication Critical patent/CN107273231A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Fault detect, processing method and processing device are tangled this application discloses a kind of distributed memory system hard disk, it judges whether the target hard disk occurs tangling failure by detecting the execution time of the corresponding access request of target hard disk, and target hard disk can be found in time tangles failure;After finding that target hard disk occurs tangling failure, on the one hand faulty hard disk is avoided to be accessed again by status indication, on the other hand the system resource that cleaning faulty hard disk takes, other processes are redistributed using these system resources, reduction hard disk tangles the adverse effect that failure may be brought, and reaches and stops loss purpose.It can be seen that, tangle fault detect and the processing scheme of the application offer had both needed not rely on HD vendor's offer detection instrument, it is not required that increase new hardware on hard disk, it is not required that human intervention, simple and easy to apply, did not interfered with production and the use cost of hard disk.

Description

Distributed memory system hard disk tangles fault detect, processing method and processing device
Technical field
Fault detect, place are tangled the present invention relates to field of computer technology, more particularly to a kind of distributed memory system hard disk Manage method and device.
Background technology
Distributed memory system is to build the storage system on local file system, and the scattered storage of data is arrived multiple by it On hard disk.For distributed memory system, have on the whole link from local file system to each hard drive internal Failure is likely to occur, wherein hard disk tangles (hang up) failure, shows as hard disk and cannot respond to normal operation, own The input-output operation of the hard disk is not all replied because of whole link and can not be stopped.If the hard disk processing tangled is not Lose response when may result in whole access process, so cause the data for being managed the process can not all access, it is preceding The problems such as end request delay is uprised, system load increase, availability of data are reduced.Therefore detecting hard disk tangles failure in time, The influence that the failure is caused is reduced, is a key issue for ensureing distributed memory system performance.
Existing hard disk, which tangles fault handling method, mainly includes following four:(1) using HD vendor provide instrument to Hard disk sends lower line, and hard disk is stopped after receiving lower line, so that the access to hard disk can be returned, eventually Only hard disk tangles state;(2) stop hard disk operational using the hardware switch of hard disk, be typically to increase by one on existing hard disk Individual part, the voltage of hard disk is directly dragged down by the part, makes hard disk power down, so that terminating hard disk tangles state;(3) Restart machine, after restarting, disk state is reset, but only exist the possibility that improvement hard disk tangles state;(4) directly Restart process, new process can evade using the hard disk tangled.
But above-mentioned processing method all has certain defect, including need to rely on extra aid, influence system money Source availability etc..Specifically, the above method (1) need to rely on the instrument of HD vendor's offer, and hard disk is not suitable for it The situation of lower line can not be received, practical application success rate is relatively low;Method (2) needs to increase new hardware on hard disk (i.e. Hardware switch), the cost increase for causing hard disk to develop and safeguard, and narrow application range;Method (3) introduces artificial dry In advance, during machine is restarted, machine is reduced with the availability of storage system in itself, and in the presence of the possibility for restarting failure, Even if restarting success, it is also desirable to which storage system can evade the use of the hard disk to tangling, the requirement to storage system is higher; Original process in method (4) is because there is thread to tangle, it is impossible to releasing memory resource so that Installed System Memory takes height, even if Having restarted the available resources of system can also reduce.Therefore, a kind of success rate is needed badly high, applied widely, available to system Property the small hard disk of influence tangle fault handling method.
The content of the invention
The application first technical problem to be solved is that distributed storage system is realized on the premise of not against aid System hard disk tangles the automatic detection of failure;Therefore, the application, which provides a kind of distributed memory system hard disk, tangles fault detect Method and device.
The application first aspect tangles fault detection method there is provided a kind of distributed memory system hard disk, including:
Detect the execution time of each corresponding access request of target hard disk;
Judge whether that the execution time is more than the time lag request of corresponding predetermined threshold value;
If there is time lag request, it is determined that the target hard disk occurs tangling failure.
With reference in a first aspect, in the application first aspect the first feasible embodiment, the fault detection method is also Including:
Create the corresponding IO sets of threads of the target hard disk;
Read by the IO sets of threads and handle each corresponding access request of the target hard disk, to complete to described The read-write operation of target hard disk.
With reference in a first aspect, or first aspect the first feasible embodiment, in second of feasible reality of first aspect Apply in mode, the execution time of each corresponding access request of detection target hard disk, including:
Detect the execution time of the access request in team's head position in the input rank of target hard disk.
The application second aspect tangles failure detector there is provided a kind of distributed memory system hard disk, including:
Detection unit, the execution time for detecting each corresponding access request of target hard disk;
Comparing unit, for judging whether that the execution time is more than the time lag request of corresponding predetermined threshold value, if there is The time lag request, it is determined that the target hard disk occurs tangling failure.
With reference to second aspect, in second aspect in the first feasible embodiment, the failure detector also includes:
Management of process unit, reads for creating the corresponding IO sets of threads of the target hard disk, and by the IO sets of threads Take and handle each corresponding access request of the target hard disk, to complete the read-write operation to the target hard disk.
With reference to second aspect, or second aspect the first feasible embodiment, in second of feasible reality of second aspect Apply in mode, to realize the execution time of corresponding each access request of detection target hard disk, the specific quilt of the detection unit It is configured to:
Detect the execution time of the access request in team's head position in the input rank of target hard disk.
From above technical scheme, the embodiment of the present application is by detecting execution time of the corresponding access request of target hard disk To judge whether the target hard disk occurs tangling failure, target hard disk can be found in time tangles failure;And this tangles event Barrier detection mode had both needed not rely on HD vendor and provides detection instrument, it is not required that increase new hardware on hard disk, also not Human intervention is needed, it is simple and easy to apply, do not interfere with production and the use cost of hard disk.
The application second technical problem to be solved is that distributed storage system is realized on the premise of not against aid System hard disk tangles automatically processing for failure;Therefore, the application, which provides a kind of distributed memory system hard disk, tangles troubleshooting Method and device.
The application third aspect tangles fault handling method there is provided a kind of distributed memory system hard disk, including:
It is to tangle malfunction by the status indication of the target hard disk when failure occurs tangling in target hard disk;
Clear up that the target hard disk is corresponding to be tangled system resource shared by managing process, new be used to manage to start Manage the managing process of the target hard disk.
With reference to the third aspect, in the third aspect in the first feasible embodiment, the corresponding quilt of the target hard disk is cleared up The system resource shared by managing process is tangled, including:
Apply for new internal memory, and following two steps are performed by the new internal memory to operate, described tangled managing process to remove and accounted for Memory source;
Search and obtain the full memory section for being tangled process occupancy;
The corresponding internal memory mapping of each application heap is released respectively.
With reference to the third aspect, or the third aspect the first feasible embodiment, in second of feasible reality of the third aspect Apply in mode, the fault handling method also includes:
Before clearing up that the target hard disk is corresponding and being tangled system resource shared by managing process, the target is ejected Each access request cached in the input rank of hard disk, and return to the fault message of the target hard disk.
With reference to the third aspect, or the third aspect the first feasible embodiment, the third feasible reality in the third aspect Apply in mode, the fault handling method also includes:
After the managing process of the target hard disk is started every time, the state of the target hard disk is determined;
If the state of the target hard disk forbids the access to the target hard disk to tangle malfunction.
With reference to the third aspect, or the third aspect the first feasible embodiment, in the 4th kind of feasible reality of the third aspect Apply in mode, the fault handling method also includes:
The malfunction that tangles of the target hard disk is preserved to normal hard disk.
The application fourth aspect tangles fault treating apparatus there is provided a kind of distributed memory system hard disk, including:
State managing unit, for being extension by the status indication of the target hard disk when failure occurs tangling in target hard disk Firmly malfunction;
Resource clears up unit, and for clearing up, the target hard disk is corresponding to be tangled system resource shared by managing process, To start the new managing process for being used to manage the target hard disk.
With reference to fourth aspect, in fourth aspect in the first feasible embodiment, to realize in the cleaning target hard disk System resource shared by managing process is tangled, the resource cleaning unit is specifically configured to, and applies for new internal memory, and Following two steps are performed by the new internal memory to operate, to remove the memory source for being tangled managing process occupancy:Search The full memory section taken by the process that tangled is obtained, and releases the corresponding internal memory mapping of each application heap respectively.
With reference to fourth aspect, or fourth aspect the first feasible embodiment, in second of feasible reality of fourth aspect Apply in mode, the fault treating apparatus also includes:
Request cleaning unit, each access request cached in the input rank for ejecting the target hard disk, and return The fault message of the target hard disk.
With reference to fourth aspect, or fourth aspect the first feasible embodiment, the third feasible reality in fourth aspect Apply in mode, the fault treating apparatus also includes:
Availability supervision unit, for after the managing process of the target hard disk is started every time, determining the target hard disk State, and the target hard disk state for tangle malfunction when, forbid the access to the target hard disk.
With reference to fourth aspect, or fourth aspect the first feasible embodiment, in the 4th kind of feasible reality of fourth aspect Apply in mode, the state managing unit is additionally operable to:The malfunction that tangles of the target hard disk is preserved to normal Hard disk.
From above technical scheme, on the one hand the embodiment of the present application passes through after finding that target hard disk occurs tangling failure Status indication avoids faulty hard disk from being accessed again, the system resource that on the other hand cleaning faulty hard disk takes so that other Process can be redistributed using these system resources, and reduction hard disk tangles the adverse effect that failure may be brought, reached only Damage purpose.It can be seen that, the troubleshooting scheme that tangles of the embodiment of the present application offer had both needed not rely on HD vendor's offer detection Instrument, it is not required that increase new hardware on hard disk, it is not required that human intervention, it is simple and easy to apply, do not interfere with hard disk Production and use cost.
It should be appreciated that the general description of the above and detailed description hereinafter are only exemplary and explanatory, can not Limit the application.
Brief description of the drawings
, below will be to embodiment or existing in order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art The accompanying drawing used required in technology description is briefly described, it should be apparent that, for those of ordinary skill in the art Speech, without having to pay creative labor, can also obtain other accompanying drawings according to these accompanying drawings.
Fig. 1 is that a kind of distributed memory system hard disk shown in the exemplary embodiment of the application one tangles fault detection method Flow chart.
Fig. 2 please to be accessed in a data memory node in the distributed memory system shown in the exemplary embodiment of the application one Seek handling process schematic diagram.
Fig. 3 is that a kind of distributed memory system hard disk shown in the exemplary embodiment of the application one tangles fault handling method Flow chart.
Fig. 4 is that another distributed memory system hard disk shown in the exemplary embodiment of the application one tangles fault handling method Flow chart.
Fig. 5 is that the distributed memory system hard disk shown in the exemplary embodiment of the application one tangles fault detect and processing method Timing diagram.
Fig. 6 is that a kind of distributed memory system hard disk shown in the exemplary embodiment of the application one tangles failure detector Structured flowchart.
Fig. 7 is that a kind of distributed memory system hard disk shown in the exemplary embodiment of the application one tangles fault treating apparatus Structured flowchart.
Fig. 8 is that another distributed memory system hard disk shown in the exemplary embodiment of the application one tangles fault treating apparatus Structured flowchart.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to attached During figure, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary is implemented Embodiment described in example does not represent all embodiments consistent with the application.On the contrary, they be only with such as The example of the consistent apparatus and method of some aspects be described in detail in appended claims, the application.
For comprehensive understanding the application, numerous concrete details are refer in the following detailed description, but art technology Personnel are it should be understood that the application can be realized without these details.In other embodiments, public affairs are not described in detail Method, process, component and the circuit known, are obscured in order to avoid undesirably resulting in embodiment.
Fig. 1 is that a kind of distributed memory system hard disk shown in the exemplary embodiment of the application one tangles fault detection method Schematic flow sheet.As shown in figure 1, the detection method includes:
S101, each corresponding access request of detection target hard disk the execution time.
The access request, can specifically include the read request (Output) that data are read from target hard disk, Yi Jixiang Data are write in target hard disk or write request (Input) of data etc. is changed, and have the managing process United Dispatching of target hard disk And perform.
S102, judge whether the execution time be more than corresponding predetermined threshold value time lag ask.
S103, if there is the time lag ask, it is determined that the target hard disk occurs tangling failure.
In practical application, no matter why planting reason (such as hardware damage, read-write excess load) causes hard disk to occur tangling event Barrier, its access request for the hard disk that directly performance all at least includes being carrying out all is not held within a very long time Row terminates.In view of this, the embodiment of the present application is judged by detecting the execution time of the corresponding access request of target hard disk Whether the target hard disk occurs tangling failure, and target hard disk can be found in time tangles failure, so as to timely handling failure; And, what the embodiment of the present application was provided, which tangle fault detection method, can be only fitted to and performed automatically in the managing process of target hard disk, Both HD vendor had been needed not rely on and detection instrument is provided, it is not required that increased new hardware on hard disk, it is not required that be artificial dry In advance, it is simple and easy to apply, do not interfere with production and the use cost of hard disk.
In one feasible embodiment of the application, above-mentioned distributed memory system hard disk tangles fault detection method, also It may comprise steps of:
S104, for the target hard disk corresponding IO sets of threads is set.
S105, read by the IO sets of threads and handle each corresponding access request of the target hard disk, to complete To the read-write operation of the target hard disk.
For ease of being managed, realizing that read-write (I/O) is serviced to hard disk, the embodiment of the present application is that target hard disk is set specially IO sets of threads, namely create in the managing process of target hard disk one group of IO thread, share the managing process is System resource, and for being only used for access request of the processing to the target hard disk;Pass through one group of IO line relative to prior art Journey serves all hard disk read-write operations simultaneously, or directly serves hard disk read-write operations, the application using user thread Embodiment is that each hard disk sets IO sets of threads respectively, can be avoided because non-hand disk failure or some hard disk failure are led Cause IO threads to be tangled, and then influence the phenomenon of the read-write operation of all hard disks.In addition, each line in IO sets of threads Cheng Binghang performs different access requests, on the one hand can improve access request treatment effeciency, that is, improves the read-write speed of hard disk Degree, on the other hand can also be when some thread performing some access request and tangled, and other threads can not be by shadow Sound continues with other access requests.
Accordingly, based on above-mentioned IO sets of threads, each corresponding access of detection target hard disk described in above-mentioned steps S101 The execution time of request, it is specifically as follows the execution time for detecting its IO sets of threads to each access request.
Further, in the application in another feasible embodiment, above-mentioned detection method can also include:For each Target disk, sets corresponding input rank.
Access request (I/O Request) is handled in any of the embodiment of the present application as shown in Figure 2 data memory node Y Schematic flow sheet, for a disk X in data memory node Y, sets one group of IO thread, for ease of distinguishing, It is T1~Tn to number it respectively in Fig. 2;Accordingly, each IO threads are correspondingly arranged an input rank, i.e. Fig. 2 The n IO queue that middle numbering is Q1 to Qn, is corresponded with IO threads.Data memory node Y, which is received, to be come from After the I/O Request of client, first the I/O Request is handled, determines it accesses which part number of the object for which disk According to, and the different I/O Requests to the same data of same disk are put into same IO queues, realize to same data Serial access, so as to avoid two I/O Requests while accessing same data.In addition, for data memory node Y, It can also have the IO threads (T0) for not binding any disk and corresponding IO queues (Q0), realize to whole section Point Y associative operation.One complete distributed memory system can include many numbers arranged side by side with data memory node Y According to memory node, the I/O Request handling process of each data memory node can use flow shown in Fig. 2.
Accordingly, the execution time of each corresponding access request of detection target hard disk described in above-mentioned steps S101, tool Body can be:Detect the execution time of the access request in team's head position in the input rank of target hard disk.
Wherein, the input rank could be arranged to first in first out (First Input First Output, FIFO) queue. By taking IO queues Q1 in Fig. 2 as an example, different access request is sequentially sequentially stored into Q1 according to the time, wherein into Q1 More early access request is closer to team's head of the Q1, then corresponding IO threads T1 reads from Q1 team head position every time One access request is simultaneously performed, and completes corresponding disk operating H1;Meanwhile, when reading the access request of team's head every time, Start to carry out timing to the execution time of the access request of this in T1, terminate until the access request is performed, if timing reaches During to predetermined threshold value, the access request is not finished yet, illustrates that the execution time of the access request exceedes predetermined threshold value, It then can be determined that the access request is asked for time lag, corresponding IO threads T1 is tangled, and then can be determined that disk X goes out Now tangle failure.
It can be seen that, input rank of the embodiment of the present application based on target hard disk is performed simultaneously according to the team's order that goes out of its access request Time progress timing is performed to it, the execution time of each access request can be accurately obtained, so as to find that time lag please in time Ask, determine that hard disk tangles failure, be that the hard disk that failure occurs tangling in timely processing is laid a good foundation.
The embodiment of the present application additionally provides a kind of distributed memory system hard disk and tangles fault handling method, and Fig. 3 shows this Tangle a kind of flow chart of fault handling method.
Comprise the following steps as shown in figure 3, this tangles fault handling method:
S201, when there is tangling failure in target hard disk, by the status indication of the target hard disk to tangle malfunction.
S203, the cleaning target hard disk are corresponding to be tangled system resource shared by managing process.
Fault detection method or other feasible detection methods are tangled based on above distributed memory system hard disk, work as judgement When failure occurs tangling in some hard disk, the processing method of the present embodiment offer can be continued executing with.Specifically, step S201 It is actual to manage operation for disk state, malfunction is tangled to the hard disk mark for occurring tangling failure;Wherein, hard disk is once It is marked as tangling malfunction, then does not allow it to re-flag again for normal condition, so as to avoid hanging again Firmly failure.It is resource clean-up operations to being tangled hard disk that step S203 is actual:Embodiment institute according to above-mentioned detection method State, when failure occurs tangling in target hard disk, certainly exist the request of at least one time lag, namely the management of target hard disk is entered Journey is tangled, and is tangled system resource shared by managing process by clearing up this, such as closes the file handle opened.
The effect of clear system resources is in step S203, on the one hand, can be to being tangled shared by managing process System resource is redistributed, for other process applications;On the other hand, the system resource quilt of process occupancy is tangled After cleaning out, this is tangled managing process and automatically exited from, namely relieves the state that tangles of the process, and then can be with Create and start new managing process to manage the target hard disk.
It can be seen that, the hard disk that the embodiment of the present application is provided tangles fault handling method, on the one hand avoids failure by status indication Hard disk is accessed again, the system resource that on the other hand cleaning faulty hard disk takes so that other processes can be redistributed Using these system resources, reduction hard disk tangles the adverse effect that failure may be brought, and reaches and stops loss purpose.And, it is above-mentioned Processing method had both needed not rely on HD vendor and provides detection instrument, it is not required that increases new hardware on hard disk, is also not required to Human intervention is wanted, it is simple and easy to apply, do not interfere with production and the use cost of hard disk.
It is being extension by the status indication of target hard disk in above-mentioned steps S201 in one feasible embodiment of the application Firmly after malfunction, it can also continue to perform following steps:This is tangled into malfunction to preserve to normal hard disk.
Above-mentioned normal hard disk is specifically as follows the system disk of whole distributed memory system, or hard with the target Take inventory other hard disks in communication connection.It is above-mentioned to be directly realized by state synchronized in hard disk, it is ensured that the hard disk tangled is i.e. Make to temporarily become upstate to be used again, so as to avoid occurring tangling failure again.
Further, in one feasible embodiment of the application, the target hard disk pair is cleared up in above-mentioned steps S203 That answers is tangled system resource shared by managing process, specifically may comprise steps of:
S2031, the new internal memory of application, and pass through new internal memory execution the following step S2032 and S2033.
The application performs specific cleanup step by new internal memory, rather than.
S2032, lookup obtain the full memory section for being tangled process occupancy.
The allocated memory headroom of the managing process of target hard disk is usually multiple application heaps, to realize cleaning completely, it is necessary to Find whole application heaps;Specifically, under a linux operating system can be from/proc/self/smaps this file Obtain the application heap.
S2033, the mapping of each application heap corresponding internal memory is released respectively.
During for managing process distributing system resource, typically by mmap operations at a certain system resource (such as one file) Mapping relations are set up between an application heap;Accordingly, when clearing up EMS memory occupation, it can release each by munmap The corresponding internal memory mapping relations of individual application heap.
It can be seen that, above-mentioned steps S2032 and S2033 actual is to perform cleaning by new internal memory to be tangled shared by managing process Memory source operation, the implementation procedure of the operation also without extra hardware toolses and human intervention, it is simple easily OK;And relative to the operation is directly performed in the internal memory that the managing process of the target hard disk was allocated originally, can avoid The thread for performing the cleanup step is also tangled with managing process.
Reference picture 4, in the application in another feasible embodiment, above-mentioned distributed memory system hard disk is tangled at failure Reason method, it is further comprising the steps of:
Each access request cached in S202, the input rank of the ejection target hard disk, and return to the target hard disk Fault message.
Managing process due to target hard disk is tangled, and each request cached in the input rank of target hard disk is (i.e. also not It is in time for the request of processing) it can not also continue to be processed, the present embodiment ejects these access requests, and is returned to user The fault message of the target hard disk, such as " hard disk error ", so as to avoid associated user from continuing waiting for untreated ask The response asked, and avoid user from sending access request to the target hard disk again.
Referring now still to Fig. 4, in the application in another feasible embodiment, above-mentioned processing method also includes:
S204, after the managing process of the target hard disk is started every time, determine the state of the target hard disk, and in institute The state of target hard disk is stated when tangling malfunction, to forbid the access to the target hard disk.
Above-mentioned steps S204 is realized to the supervision of the availability of target hard disk, when both can be implemented in new hard disk and enabling, is used Failure detection steps are tangled to target hard disk in starting, to realize the real-time oversight to target hard disk availability;Step S204 It can also carry out after above-mentioned steps S203, that is, after the managing process for restarting faulty hard disk, due in step s 201 Faulty hard disk has been marked as tangling malfunction, therefore can refuse all visits for being directed to the faulty hard disk by step S204 Ask request, it is to avoid the faulty hard disk is accessed and causes process to tangle again again.
In addition, the distributed memory system hard disk that Fig. 5 illustrates described in the embodiment of the present application by the form of timing diagram is tangled Fault detect and handling process.Reference picture 5 is corresponding to set up simultaneously after the managing process of a data memory node starts Start Hang disks detection thread, be periodically detected the disk that whether there is in the data memory node and occur tangling failure (Hang disks);Wherein, the detection operation performed by Hang disks detection thread is specifically included, for each of disk IO threads, detect whether there is the request for not returning to implementing result for a long time (i.e. time lag is asked), if certain of disk X There is time lag request in individual IO threads, illustrate that disk X is lived by Hang, then start Hang disks cleaning thread, clear up whole Various resources, internal memory (memory) shared by individual data memory node managing process, functional dependencies (Functional Dependency, FD) etc., and recording disc X state is to tangle malfunction (Hang states) on system disk; Then restart current managing process, obtain new managing process, after new managing process starts, the storage is recognized first The state of the disk of each in node, to disable the disk that (ignoring) is labeled as Hang states.
The description of embodiment of the method more than, it is apparent to those skilled in the art that the application can be borrowed Help software to add the mode of required general hardware platform to realize, naturally it is also possible to by hardware, but in many cases the former It is more preferably embodiment.Understood based on such, the technical scheme of the application is substantially made to prior art in other words The part of contribution can be embodied in the form of software product, and be stored in a storage medium, including some instructions To cause distributed memory system to perform all or part of step of each embodiment methods described of the application.And it is foregoing Storage medium includes:Read-only storage (ROM), random access memory (RAM), magnetic disc or CD etc. are various Can be with data storage and the medium of program code.
Fig. 6 is that a kind of distributed memory system hard disk shown in the exemplary embodiment of the application one tangles failure detector Structured flowchart.As shown in fig. 6, the detection means includes:Detection unit 101 and comparing unit 102.
Wherein, detection unit 101 is used for, the execution time of each corresponding access request of detection target hard disk;
Comparing unit 102 is used for, and judges whether that the execution time is more than the time lag request of corresponding predetermined threshold value, if There is the time lag request, it is determined that the target hard disk occurs tangling failure.
From above technical scheme, the embodiment of the present application is by detecting execution time of the corresponding access request of target hard disk To judge whether the target hard disk occurs tangling failure, target hard disk can be found in time tangles failure, to locate in time Manage failure;And, the embodiment of the present application had both needed not rely on HD vendor and provides detection instrument, it is not required that increase on hard disk Plus new hardware, it is not required that human intervention, it is simple and easy to apply, do not interfere with production and the use cost of hard disk.
In one feasible embodiment of the application, above-mentioned detection device can also include:Management of process unit;This enters Thread management unit is used for, and creates the corresponding IO sets of threads of the target hard disk, and read and locate by the IO sets of threads Each corresponding access request of the target hard disk is managed, to complete the read-write operation to the target hard disk.
In the application in another feasible embodiment, the detection unit 101 in above-mentioned detection device can specifically be configured For:Detect the execution time of the access request in team's head position in the input rank of target hard disk.
I.e. in the access request by the input rank caching of target hard disk based on FIFO rules, the management of target hard disk Process (more specifically, can be above-mentioned IO sets of threads) only from the team of input rank position read access request and is opened Begin to perform, therefore when the access request of team's head is read, the execution time to the access request that starts carries out timing, until The access request, which is performed, to be terminated, if timing reaches predetermined threshold value, the access request is not finished yet, illustrates the visit Ask that the execution time of request exceedes predetermined threshold value, then can be determined that the access request is asked for time lag, and then can be determined that phase The target hard disk answered occurs tangling failure.
It can be seen that, input rank of the embodiment of the present application based on target hard disk is performed simultaneously according to the team's order that goes out of its access request Time progress timing is performed to it, the execution time of each access request can be accurately obtained, so as to find that time lag please in time Ask, determine that hard disk tangles failure, be that the hard disk that failure occurs tangling in timely processing is laid a good foundation.
Fig. 7 is that a kind of distributed memory system hard disk shown in the exemplary embodiment of the application one tangles fault treating apparatus Structured flowchart.As shown in fig. 7, the processing unit includes:State managing unit 201 and resource cleaning unit 203.
Wherein, state managing unit 201 is used for, when failure occurs tangling in hard disk, by the mesh for occurring tangling failure The status indication of mark hard disk is to tangle malfunction;
Resource cleaning unit 203 is used for, and the cleaning target hard disk is corresponding to be provided by the system tangled shared by managing process Source, to start the new managing process for being used to manage the target hard disk.
From above technical scheme, the hard disk that the embodiment of the present application is provided tangles fault treating apparatus, on the one hand passes through shape State mark avoids faulty hard disk from being accessed again, the system resource that on the other hand cleaning faulty hard disk takes so that other enter Journey can be redistributed using these system resources, and reduction hard disk tangles the adverse effect that failure may be brought, reaches and stop loss Purpose.And, above-mentioned processing unit had both needed not rely on HD vendor and provides detection instrument, it is not required that increase on hard disk New hardware, it is not required that human intervention, it is simple and easy to apply, do not interfere with production and the use cost of hard disk.
In one feasible embodiment of the application, above-mentioned state managing unit 201 is by the status indication of target hard disk To tangle after malfunction, this can also be tangled to malfunction and preserved to normal hard disk.
The present embodiment passes through the direct state synchronized of different hard disks, it is ensured that though the hard disk tangled temporarily become it is available State can not be used again, so as to avoid occurring tangling failure again.
In one feasible embodiment of the application, to realize that being tangled managing process in the cleaning target hard disk takes System resource, resource cleaning unit 203 is specifically configured to, and applies for new internal memory, and pass through it is described it is new in counter foil The following two steps operations of row, to remove the memory source for being tangled managing process occupancy:Lookup obtain it is described tangled into Cheng Zhanyong full memory section, and the corresponding internal memory mapping of each application heap is released respectively.
Reference picture 8, in the application in another feasible embodiment, above-mentioned fault treating apparatus can also include:Please Seek cleaning unit 202.Request cleaning unit 202 is used for, and what is cached in the input rank for ejecting the target hard disk is each Individual access request, and return to the fault message of the target hard disk.
Referring now still to Fig. 8, above-mentioned fault treating apparatus can also include:Availability supervision unit 204;The availability is supervised Unit 204 is used for, after the managing process of the target hard disk is started every time, determines the state of the target hard disk, and When the state of the target hard disk is tangles malfunction, forbid the access to the target hard disk.
It can be seen that, pass through above-mentioned availability supervision unit, it is possible to achieve to the real-time oversight of target hard disk availability, and in mesh Mark hard disk refuses all access requests for being directed to the faulty hard disk when there is tangling failure, it is to avoid the faulty hard disk again by Access and cause process to tangle again.
Each embodiment in this specification is described by the way of progressive, identical similar part between each embodiment Mutually referring to what each embodiment was stressed is the difference with other embodiment.Especially for device Or for system embodiment, because it is substantially similar to embodiment of the method, so describing fairly simple, related part is joined See the part explanation of embodiment of the method.
Described above is only the embodiment of the application, is made skilled artisans appreciate that or realizing the application. A variety of modifications to these embodiments will be apparent to one skilled in the art, and as defined herein one As principle can in other embodiments be realized in the case where not departing from spirit herein or scope.Therefore, this Shen The embodiments shown herein please be not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty Consistent most wide scope.

Claims (16)

1. a kind of distributed memory system hard disk tangles fault detection method, it is characterised in that including:
Detect the execution time of each corresponding access request of target hard disk;
Judge whether that the execution time is more than the time lag request of corresponding predetermined threshold value;
If there is time lag request, it is determined that the target hard disk occurs tangling failure.
2. detection method according to claim 1, it is characterised in that also include:
Create the corresponding IO sets of threads of the target hard disk;
Read by the IO sets of threads and handle each corresponding access request of the target hard disk, to complete to described The read-write operation of target hard disk.
3. detection method according to claim 1 or 2, it is characterised in that each corresponding visit of detection target hard disk Ask the execution time of request, including:
Detect the execution time of the access request in team's head position in the input rank of target hard disk.
4. a kind of distributed memory system hard disk tangles fault handling method, it is characterised in that including:
It is to tangle malfunction by the status indication of the target hard disk when failure occurs tangling in target hard disk;
Clear up that the target hard disk is corresponding to be tangled system resource shared by managing process, new be used to manage to start Manage the managing process of the target hard disk.
5. fault handling method according to claim 4, it is characterised in that the corresponding quilt of the cleaning target hard disk The system resource shared by managing process is tangled, including:
Apply for new internal memory, and following two steps are performed by the new internal memory to operate, described tangled managing process to remove and accounted for Memory source;
Search and obtain the full memory section for being tangled process occupancy;
The corresponding internal memory mapping of each application heap is released respectively.
6. the fault handling method according to claim 4 or 5, the target hard disk is corresponding to be tangled pipe clearing up Before system resource shared by reason process, in addition to:
Each access request cached in the input rank for ejecting the target hard disk, and return to the failure of the target hard disk Information.
7. the fault handling method according to claim 4 or 5, it is characterised in that also include:
After the managing process of the target hard disk is started every time, the state of the target hard disk is determined;
If the state of the target hard disk forbids the access to the target hard disk to tangle malfunction.
8. the fault handling method according to claim 4 or 5, it is characterised in that also include:
The malfunction that tangles of the target hard disk is preserved to normal hard disk.
9. a kind of distributed memory system hard disk tangles failure detector, it is characterised in that including:
Detection unit, the execution time for detecting each corresponding access request of target hard disk;
Comparing unit, for judging whether that the execution time is more than the time lag request of corresponding predetermined threshold value, if there is The time lag request, it is determined that the target hard disk occurs tangling failure.
10. failure detector according to claim 9, it is characterised in that also include:
Management of process unit, reads for creating the corresponding IO sets of threads of the target hard disk, and by the IO sets of threads Take and handle each corresponding access request of the target hard disk, to complete the read-write operation to the target hard disk.
11. the failure detector according to claim 9 or 10, it is characterised in that to realize detection target hard disk The execution time of each corresponding access request, the detection unit is specifically configured to:
Detect the execution time of the access request in team's head position in the input rank of target hard disk.
12. a kind of distributed memory system hard disk tangles fault treating apparatus, it is characterised in that including:
State managing unit, for being extension by the status indication of the target hard disk when failure occurs tangling in target hard disk Firmly malfunction;
Resource clears up unit, and for clearing up, the target hard disk is corresponding to be tangled system resource shared by managing process, To start the new managing process for being used to manage the target hard disk.
13. fault treating apparatus according to claim 12, it is characterised in that to realize the cleaning target hard disk Middle to be tangled system resource shared by managing process, the resource cleaning unit is specifically configured to,
Apply for new internal memory, and following two steps are performed by the new internal memory to operate, described tangled managing process to remove and accounted for Memory source:Search and obtain the full memory section taken by the process that tangled, and release each application heap respectively Corresponding internal memory mapping.
14. the fault treating apparatus according to claim 12 or 13, it is characterised in that also include:
Request cleaning unit, each access request cached in the input rank for ejecting the target hard disk, and return The fault message of the target hard disk.
15. the fault treating apparatus according to claim 12 or 13, it is characterised in that also include:
Availability supervision unit, for after the managing process of the target hard disk is started every time, determining the target hard disk State, and the target hard disk state for tangle malfunction when, forbid the access to the target hard disk.
16. the fault treating apparatus according to claim 12 or 13, it is characterised in that the state managing unit, It is additionally operable to:
The malfunction that tangles of the target hard disk is preserved to normal hard disk.
CN201610212740.0A 2016-04-07 2016-04-07 Distributed memory system hard disk tangles fault detect, processing method and processing device Pending CN107273231A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201610212740.0A CN107273231A (en) 2016-04-07 2016-04-07 Distributed memory system hard disk tangles fault detect, processing method and processing device
TW106107797A TW201737111A (en) 2016-04-07 2017-03-09 Method and device for detecting and processing hard disk hanging fault in distributed storage system
PCT/CN2017/077995 WO2017173927A1 (en) 2016-04-07 2017-03-24 Method and device for detecting and processing hard disk hanging fault in distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610212740.0A CN107273231A (en) 2016-04-07 2016-04-07 Distributed memory system hard disk tangles fault detect, processing method and processing device

Publications (1)

Publication Number Publication Date
CN107273231A true CN107273231A (en) 2017-10-20

Family

ID=60000846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610212740.0A Pending CN107273231A (en) 2016-04-07 2016-04-07 Distributed memory system hard disk tangles fault detect, processing method and processing device

Country Status (3)

Country Link
CN (1) CN107273231A (en)
TW (1) TW201737111A (en)
WO (1) WO2017173927A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170375A (en) * 2017-12-21 2018-06-15 创新科存储技术有限公司 Transfinite guard method and device in a kind of distributed memory system
CN108762913A (en) * 2018-03-23 2018-11-06 阿里巴巴集团控股有限公司 service processing method and device
CN108776579A (en) * 2018-06-19 2018-11-09 郑州云海信息技术有限公司 A kind of distributed storage cluster expansion method, device, equipment and storage medium
CN108932113A (en) * 2018-06-28 2018-12-04 郑州云海信息技术有限公司 A kind of disk management method, device, equipment and readable storage medium storing program for executing
CN110688193A (en) * 2018-07-04 2020-01-14 阿里巴巴集团控股有限公司 Disk processing method and device
CN110750213A (en) * 2019-09-09 2020-02-04 华为技术有限公司 Hard disk management method and device
CN110795276A (en) * 2018-08-01 2020-02-14 阿里巴巴集团控股有限公司 Storage medium repairing method, computer equipment and storage medium
CN110837428A (en) * 2018-08-16 2020-02-25 杭州海康威视系统技术有限公司 Storage device management method and device
CN111897684A (en) * 2020-07-15 2020-11-06 中国工商银行股份有限公司 Disk fault simulation test method and device and electronic equipment
WO2024082834A1 (en) * 2022-10-18 2024-04-25 苏州元脑智能科技有限公司 Disk arbitration area detection method and apparatus, device, and nonvolatile readable storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109739702A (en) * 2018-12-18 2019-05-10 曙光信息产业股份有限公司 Hard disk automated detection method
CN109669828B (en) * 2018-12-21 2021-11-26 郑州云海信息技术有限公司 Hard disk detection method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6324490B1 (en) * 1999-01-25 2001-11-27 J&L Fiber Services, Inc. Monitoring system and method for a fiber processing apparatus
US20020001152A1 (en) * 2000-06-29 2002-01-03 Ikuko Iida Disk controller for detecting hang-up of disk storage system
CN101127233A (en) * 2007-09-25 2008-02-20 Ut斯达康通讯有限公司 Hard disc error detection and fault-tolerant method in stream media uses
CN101650669A (en) * 2008-08-14 2010-02-17 英业达股份有限公司 Method for executing disk read-write under multi-thread
CN104734979A (en) * 2015-04-07 2015-06-24 北京极科极客科技有限公司 Control method for storage device externally connected with router

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7000154B1 (en) * 2001-11-28 2006-02-14 Intel Corporation System and method for fault detection and recovery
CN101296135A (en) * 2008-06-27 2008-10-29 中兴通讯股份有限公司 Fault information processing method and device
CN103383689A (en) * 2012-05-03 2013-11-06 阿里巴巴集团控股有限公司 Service process fault detection method, device and service node
CN103488544B (en) * 2013-09-26 2016-08-17 华为技术有限公司 Detect the treating method and apparatus of slow dish
CN103761180A (en) * 2014-01-11 2014-04-30 浪潮电子信息产业股份有限公司 Method for preventing and detecting disk faults during cluster storage
CN104461865A (en) * 2014-11-04 2015-03-25 哈尔滨工业大学 Cloud environment distributed file system reliability test suite

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6324490B1 (en) * 1999-01-25 2001-11-27 J&L Fiber Services, Inc. Monitoring system and method for a fiber processing apparatus
US20020001152A1 (en) * 2000-06-29 2002-01-03 Ikuko Iida Disk controller for detecting hang-up of disk storage system
CN101127233A (en) * 2007-09-25 2008-02-20 Ut斯达康通讯有限公司 Hard disc error detection and fault-tolerant method in stream media uses
CN101650669A (en) * 2008-08-14 2010-02-17 英业达股份有限公司 Method for executing disk read-write under multi-thread
CN104734979A (en) * 2015-04-07 2015-06-24 北京极科极客科技有限公司 Control method for storage device externally connected with router

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170375B (en) * 2017-12-21 2020-12-18 创新科技术有限公司 Overrun protection method and device in distributed storage system
CN108170375A (en) * 2017-12-21 2018-06-15 创新科存储技术有限公司 Transfinite guard method and device in a kind of distributed memory system
CN108762913A (en) * 2018-03-23 2018-11-06 阿里巴巴集团控股有限公司 service processing method and device
CN108776579A (en) * 2018-06-19 2018-11-09 郑州云海信息技术有限公司 A kind of distributed storage cluster expansion method, device, equipment and storage medium
CN108776579B (en) * 2018-06-19 2021-10-15 郑州云海信息技术有限公司 Distributed storage cluster capacity expansion method, device, equipment and storage medium
CN108932113A (en) * 2018-06-28 2018-12-04 郑州云海信息技术有限公司 A kind of disk management method, device, equipment and readable storage medium storing program for executing
CN110688193B (en) * 2018-07-04 2023-05-09 阿里巴巴集团控股有限公司 Disk processing method and device
CN110688193A (en) * 2018-07-04 2020-01-14 阿里巴巴集团控股有限公司 Disk processing method and device
CN110795276A (en) * 2018-08-01 2020-02-14 阿里巴巴集团控股有限公司 Storage medium repairing method, computer equipment and storage medium
CN110837428A (en) * 2018-08-16 2020-02-25 杭州海康威视系统技术有限公司 Storage device management method and device
CN110837428B (en) * 2018-08-16 2023-09-19 杭州海康威视系统技术有限公司 Storage device management method and device
WO2021047234A1 (en) * 2019-09-09 2021-03-18 华为技术有限公司 Hard disk management method and apparatus
CN110750213A (en) * 2019-09-09 2020-02-04 华为技术有限公司 Hard disk management method and device
CN111897684A (en) * 2020-07-15 2020-11-06 中国工商银行股份有限公司 Disk fault simulation test method and device and electronic equipment
CN111897684B (en) * 2020-07-15 2023-08-15 中国工商银行股份有限公司 Method and device for simulating and testing disk faults and electronic equipment
WO2024082834A1 (en) * 2022-10-18 2024-04-25 苏州元脑智能科技有限公司 Disk arbitration area detection method and apparatus, device, and nonvolatile readable storage medium

Also Published As

Publication number Publication date
WO2017173927A1 (en) 2017-10-12
TW201737111A (en) 2017-10-16

Similar Documents

Publication Publication Date Title
CN107273231A (en) Distributed memory system hard disk tangles fault detect, processing method and processing device
CN105431862B (en) For the key rotation of Memory Controller
US8365009B2 (en) Controlled automatic healing of data-center services
US8862833B2 (en) Selection of storage containers for thin-partitioned data storage based on criteria
CN109542645A (en) A kind of method, apparatus, electronic equipment and storage medium calling service
CN108334396A (en) The creation method and device of a kind of data processing method and device, resource group
CN107391268A (en) service request processing method and device
CN106233269A (en) Fine granulation bandwidth supply in Memory Controller
US20150058865A1 (en) Management of bottlenecks in database systems
CN103226598A (en) Method and device for accessing database and data base management system
CN109614276A (en) Fault handling method, device, distributed memory system and storage medium
CN106598801A (en) Coroutine monitoring method and apparatus
CN106484330A (en) A kind of hybrid magnetic disc individual-layer data optimization method and device
CN110580195B (en) Memory allocation method and device based on memory hot plug
CN102063338A (en) Method and device for requesting exclusive resource
TWI759708B (en) Method and apparatus for concurrently executing transactions in a blockchain and computer-readable storage medium and computing device
CN108196940A (en) Delete the method and relevant device of container
CN109669822A (en) The creation method and computer readable storage medium of electronic device, spare memory pool
CN107203451B (en) Method and apparatus for handling failures in a storage system
CN107368324A (en) A kind of component upgrade methods, devices and systems
CN102880467A (en) Method for verifying Cache coherence protocol and multi-core processor system
US20090187614A1 (en) Managing Dynamically Allocated Memory in a Computer System
CN112711462A (en) Cloud platform virtual CPU hot binding method and device and computer readable storage medium
CN104734896A (en) Method and system for acquiring running situations of service sub-systems
JP6651836B2 (en) Information processing apparatus, shared memory management method, and shared memory management program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1245441

Country of ref document: HK

RJ01 Rejection of invention patent application after publication

Application publication date: 20171020

RJ01 Rejection of invention patent application after publication