CN101127233A - Hard disc error detection and fault-tolerant method in stream media uses - Google Patents

Hard disc error detection and fault-tolerant method in stream media uses Download PDF

Info

Publication number
CN101127233A
CN101127233A CNA2007101612128A CN200710161212A CN101127233A CN 101127233 A CN101127233 A CN 101127233A CN A2007101612128 A CNA2007101612128 A CN A2007101612128A CN 200710161212 A CN200710161212 A CN 200710161212A CN 101127233 A CN101127233 A CN 101127233A
Authority
CN
China
Prior art keywords
hard disk
streaming media
piece
overtime
media service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007101612128A
Other languages
Chinese (zh)
Other versions
CN100595839C (en
Inventor
陈俊楷
谢主中
皮佩文
曾文涛
陶宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ud Network Co ltd
Original Assignee
UTStarcom Telecom Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by UTStarcom Telecom Co Ltd filed Critical UTStarcom Telecom Co Ltd
Priority to CN200710161212A priority Critical patent/CN100595839C/en
Publication of CN101127233A publication Critical patent/CN101127233A/en
Application granted granted Critical
Publication of CN100595839C publication Critical patent/CN100595839C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0727Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The utility model discloses a hard disk fault detection and fault tolerant method based on streaming media service application, comprising: monitoring each I/O operation based on streaming media service application; counting the monitored result of a plurality of I/O operations in a certain duration; confirming whether hard disk faults of hard disk damage, mistaken hard disk stored content, bad hard disk block and hard disk instantaneous fault are appeared in hard disk based on the statistics of the monitored result of each I/O operation or a plurality of I/O operations; and conducting corresponding fault tolerant treatment combining streaming media service application. The method can rapidly and accurately detect the problems that seriously influence streaming media service application but are not easy to be found by traditional hard disk detection techniques such as S.M.A.R.T. The method of the utility model can accurately and rapidly find various faults of hard disk, and take corresponding fault tolerant measures, avoids or reduces the influence on streaming media service systems, thus improve the reliability and stability of the system.

Description

Hard disk error during Streaming Media is used detects and fault-tolerance approach
Technical field
The present invention relates to the technical field of memory of Streaming Media and IPTV (Web TV), the Hard disk error that especially relates in the technical field of memory of Streaming Media and IPTV detects and fault-tolerance approach.
Background technology
In the Streaming Media of high code check is used, because real-time, the performance and the stability and reliability of storage system all there is very high requirement.As the hard disk of storage medium, under the continual business model of this high pressure of streaming media service, problem such as the transient error, bad piece, performance that hard disk usually occurs descends, content is made mistakes, thus cause having a strong impact on to streaming media service.
The transient error of hard disk is meant because the influence or the hard disk itself of surrounding environment, and hard disk such as several seconds to tens seconds, is used the result who is waiting for that always an I/O who reads or writes operates in certain short period.It will be appreciated by those skilled in the art that: for most general service, the integrality of data and reliability are most important, and the wait once in a while during reading of data is not a problem.But, for streaming media service, (that is to say Deng pending data, the result's of I/O that one of applications wait reads or writes operation state) means and to cause streaming media service to interrupt or pause, the experience that this meeting brings extreme difference to the user, therefore in streaming media service was used, waiting pending data can be a serious problem.The wait of writing also is similar, uses for streaming media service, because data volume is very big, writes wait for a long time and can cause overflowing of core buffer.
The bad piece of hard disk is meant at regular overtime of the I/O operation meeting that reads or writes of certain piece of hard disk or directly returns failure.The regular overtime meeting that reads or writes of certain piece causes regular hard disk transient error, streaming media service is caused have a strong impact on.Same regular reading failure then can the streaming media service to playing this content repeatedly causes to have a strong impact on.And certain piece regular write failure, also must find early and avoid, in case because the write operation failure appears in certain piece, follow-up read operation meeting is easy to go wrong.
The performance decline of hard disk is meant that the performance of on average reading or writing of hard disk can't satisfy the needs as streaming media service.Usually limit the ability of streaming media service at the I/O ability that reads or writes of hard disk, must adjust the ability of streaming media service in real time according to the I/O ability of hard disk.When the I/O of hard disk ability drop arrived to a certain degree, this hard disk just must withdraw from and serve.
The problem that content is made mistakes is meant that the content that reads out in the hard disk is wrong.The content that reads out in hard disk is wrong except the reason of hard disk itself, usually is that the problem owing to software module causes.No matter be which kind of reason, content is made mistakes can cause to streaming media service equally and is seriously influenced.
On the other hand, be the hard disk detection technique of industry standard with S.M.A.R.T (Self-Monitoring, Analysis andReporting Technology, self-monitoring, analysis and reporting techniques), usually can only detect the hard disk that part is physically damaged.Practical experience and experimental data show, streaming media service is caused the above hard disk problem that has a strong impact on, promptly, the transient error of hard disk, bad piece, performance descend, content is made mistakes problem, S.M.A.R.T usually can not detect, perhaps think it is not mistake, therefore can't find this type of hard disk problem in time and take corresponding fault-tolerant measure.
Summary of the invention
For solving the above-mentioned problems in the prior art,, provide a kind of Hard disk error of using based on streaming media service to detect and fault-tolerance approach according to the present invention.Described method comprises: each I/O operation of using based on streaming media service is monitored, and wherein said monitoring comprises the stand-by period of this I/O operation of monitoring and the result that system returns this I/O operation, and the quick check results of data during read operation; Monitored results to the operation of the repeatedly I/O in the predetermined lasting time is added up; Based on the monitored results of each I/O operation or the statistics of the monitored results of I/O operation repeatedly, determine whether occur hard disk transient error, bad piece of hard disk in the hard disk, rigid disk storage contents is wrong and the Hard disk error of hard disk corruptions; And should be used as corresponding fault-tolerant processing in conjunction with streaming media service.
Detect and fault-tolerance approach according to Hard disk error of the present invention, can detect rapidly and accurately and various the problem that has a strong impact on and don't found by traditional hard disk detection techniques such as S.M.A.R.T easily that causes be used in streaming media service.Utilize the inventive method, can find the various mistakes of hard disk quickly and accurately, and take fault-tolerant targetedly measure, avoid or reduce influence, thereby improve the reliability and stability of system stream media service system.
Description of drawings
From the description of the following preferred embodiment of the present invention that mode with non-limitative example is provided and from appended drawings, can clear more these and other characteristics of the present invention, advantage and beneficial effect, wherein:
Fig. 1 illustrates the Hard disk error of using based on streaming media service according to the present invention and detects and the fault-tolerance approach schematic flow diagram;
Fig. 2 illustrates according to hard disk transient error of the present invention and handles schematic flow diagram;
Fig. 3 illustrates according to bad piece of hard disk of the present invention and handles schematic flow diagram;
Fig. 4 illustrates according to rigid disk storage contents fault processing schematic flow diagram of the present invention; And
Fig. 5 illustrates according to hard disk corruptions of the present invention and handles schematic flow diagram.
Embodiment
Below in conjunction with the drawings; preferred implementation of the present invention is described; should be appreciated that; here the preferred implementation of Miao Shuing not is restrictive explanation; those skilled in the art can be according to principle of the present invention, the present invention is made various modifications, improvement and can not break away from the protection domain that claim limits of enclosing.
The invention provides a kind of Hard disk error of using based on streaming media service detects and fault-tolerance approach.Detect and fault-tolerance approach according to Hard disk error of the present invention, can detect rapidly and accurately and comprise that hard disk transient error, bad piece of hard disk, hard disk performance descend and the various problems that have a strong impact on and don't found by traditional hard disk detection techniques such as S.M.A.R.T easily that streaming media service is caused such as rigid disk storage contents is wrong.And the present invention does corresponding fault-tolerant processing in conjunction with the streaming media service application program, avoids thus or reduces influence to stream media service system, thereby improved the reliability and stability of system.
It will be recognized by those skilled in the art that the server (or claiming node) that provides Streaming Media to use can have a plurality of hard disks.And, use for the streaming media service that is provided, can on (local node) on the server that provides described streaming media service to use, have the backup that this streaming media service is used, perhaps on other servers (other node), store the backup that this streaming media service is used.
Those skilled in the art it will also be appreciated that each hard disk that provides is divided into a plurality of.The read/write operation of each I/O that the operation Streaming Media is used is carried out based on each piece.
According to the present invention, provide a kind of Hard disk error of using based on streaming media service to detect and fault-tolerance approach.Described method comprises: each I/O operation of using based on streaming media service is monitored, and wherein said monitoring comprises the stand-by period of this I/O operation of monitoring and the result that system returns this I/O operation, and the quick check results of data during read operation; Monitored results to the operation of the repeatedly I/O in the predetermined lasting time is added up; Based on the monitored results of each I/O operation or the statistics of the monitored results of I/O operation repeatedly, determine whether occur hard disk transient error, bad piece of hard disk in the hard disk, rigid disk storage contents is wrong and the Hard disk error of hard disk corruptions; And should be used as corresponding fault-tolerant processing in conjunction with streaming media service.
For the detection of hard disk transient error, preferably, adopt based on the result and the time method that detect the reading writing harddisk operation and judge whether hard disk transient error occurs.Particularly, whether whether overtime or each I/O operation fails the stand-by period of judging each I/O operation; Whether wherein, stand-by period and a pre-set threshold that each I/O is operated compare, overtime to judge this I/O operation.When the stand-by period of this time I/O operation, return results overtime or that system operates this time I/O was failure, then determine to occur transient error.
According to one embodiment of the present invention, when the stand-by period of certain I/O operation of finding certain hard disk surpasses the threshold value that pre-sets, then carry out corresponding fault-tolerant processing at transient error.Particularly, when the stand-by period of read operation surpasses the threshold value that sets in advance, pausing and blocking appears in the streaming media service for fear of the read operation relevant with this hard disk, the notification streams media services are used, and the relevant streaming media service of all meeting reader hard disks is all switched on other node that stores related content.Do not have overtime and directly return the situation of failure for read operation, then the notification streams media services are used, and only switch a Streaming Media relevant with this read operation and arrive other node.When the stand-by period of write operation surpasses the threshold value that sets in advance, for the streaming media service operation relevant with write operation, all data that this hard disk is write in the buffer zone are all write other hard disk in this node, promptly then adopt the mode that other hard disk of this node is all write in follow-up all operations that this hard disk is write to avoid waits for too long and cause that compose buffer overflows.Do not have overtime for write operation and directly return the situation of failure, then only walk around the piece that this is write, apply for that a new piece on this hard disk carries out write operation.
Based on when the operation Streaming Media is used, detect instantaneous operating mistake, further whether judgement is failed to the read/write operation of this piece or whether the overtime aggregate-value of the read/write operation of this piece is surpassed the threshold value that sets in advance.When the read/write operation of this piece failure or overtime aggregate-value are surpassed threshold value, determine that then this piece is a bad piece.And carry out corresponding fault-tolerant processing at this bad piece.
According to one embodiment of the present invention, when determining the bad piece of hard disk to occur, carry out corresponding fault-tolerant processing at this bad piece.Described processing comprises: for write operation failure or overtime and the bad piece that does not have content (media file) that cause only needs be designated bad piece to it, do not re-use later on.And for read operation failure or overtime and bad piece that cause then will recover the content corresponding in this bad piece.According to one embodiment of the present invention, the mode of recovery is to redistribute a new piece in this hard disk, and copies the backup corresponding to the content corresponding in this bad piece, and the backup that is copied is kept in the new piece that is distributed.According to one embodiment of the present invention, if the content in this bad piece has backup on other hard disks of this intranodal, then from its copy content corresponding.Otherwise,, and be kept at the new piece that is distributed from this other node copy content corresponding and with the content that is copied if the not backup on other hard disks of this intranodal of the content in this bad piece then by searching, determines to store other node of this content.If the quantity of the bad piece of a hard disk surpasses threshold value, illustrate that then this hard disk has been difficult to be used again, will enter the treatment scheme of hard disk corruptions.
For the wrong problem of the content that reads out from hard disk, for the consideration of efficient and performance, adopt general checking algorithm concerning the very large Streaming Media of I/O pressure is used, obviously be inappropriate.The present invention adopts is a kind ofly to combine the special check information of ad-hoc location insertion in each piece of hard disk, the method for carrying out quick verification after at every turn reading out with the Streaming Media application module.To those skilled in the art, can adopt multiple method of calibration of the prior art, inventive point of the present invention be with corresponding to the Streaming Media application module, ad-hoc location in each piece of hard disk inserts special check information.Particularly, when the operation streaming media service was used, the ad-hoc location in each piece of hard disk inserted special check information; And, the content that is read is carried out verification based on the check information that is inserted.
When detecting rigid disk storage contents when wrong, main treatment scheme is that the notification streams media services are used, and switches the Streaming Media relevant with this read operation and arrives other node, the process of the line data of going forward side by side recovery.According to one embodiment of the present invention, the process that described data are recovered further comprises: the node location of searching other backup that stores this media file; Initiate the document copying recovery request to the node that stores this media file; And copy corresponding data, and the data that copied are write back this piece again.
The time that the detection that Hard Disk I/O performance descends is based on each I/O read-write operation is equally calculated the average read and write performance of hard disk in a period of time, and periodically gives the streaming media service application module with the I/O performance feedback in allowed limits.Preferably, the detection that descends of Hard Disk I/O performance can be based on the wrong detection of rigid disk storage contents and the detection of bad piece of hard disk.Perhaps when the summation of the quantity of bad piece of hard disk has surpassed pre-set threshold, can judge directly that hard disk can't normal service.When the I/O performance that detects hard disk is lower than that threshold value or bad piece surpass threshold value and can't continue to serve the time, then enters the treatment scheme of hard disk corruptions.
The processing of hard disk corruptions mainly comprises all related streams media services of notification streams media services application switching, notice upper strata memory management module stops the relevant I/O operation of this hard disk, in this node storage management system, isolate this hard disk, and send user's alarm, require to change hard disk.
With reference to the accompanying drawings, the specific embodiment of the present invention is described in further detail.
Fig. 1 illustrates the Hard disk error of using based on streaming media service according to the present invention and detects and the fault-tolerance approach synoptic diagram.With reference to Fig. 1, the steps include:
In step S101, the I/O operation that medium are used based on streaming media service is monitored wherein said stand-by period and the return results that monitors this I/O operation that comprise.
Process enters into determining step S102 then, judges whether overtime whether or this time I/O operation of stand-by period of this time I/O operation fail.If the stand-by period of this I/O operation, return results overtime or that system operates this time I/O was failure, enters step S103, otherwise enter S107.
In step S103, the transient error that this I/O operation causes is handled, avoid influence to streaming media service.Carrying out corresponding fault-tolerant processing at transient error will 2 be elaborated with reference to the accompanying drawings.
When there was transient error in judgement, process entered into step S104.In step S104, based on when the operation Streaming Media is used, detect instantaneous operating mistake, further whether judgement is failed to the read/write operation of each piece or whether the overtime aggregate-value of the read/write operation of each piece is surpassed the threshold value that sets in advance.If the aggregate-value overtime to the read/write operation of each piece surpasses the threshold value that sets in advance, then process enters step S105, and else process enters step S111.
In step S105, this piece is set to bad piece, and the relevant treatment of the piece that makes sad work of it, to reduce and to avoid the influence of bad piece to streaming media service.The corresponding fault-tolerant processing performed at bad piece will 3 be described in detail with reference to the accompanying drawings.
Among the step S106, judge whether the overtime aggregate-value of the read/write operation of each piece is surpassed the threshold value that sets in advance.If the aggregate-value overtime to the read/write operation of each piece surpasses the threshold value that sets in advance, then process enters step S112, otherwise enters step S111.
On the other hand, if when step S102 determines that the stand-by period of this I/O operation does not have result overtime and that system returns this I/O operation not to be failure, then process enters into step S107, judges that further this I/O operation is read operation or write operation.If this I/O operation is read operation, then process enters into step S108, and else process enters into step S111.
In step S108, the media data that reads out is carried out quick verification.
Process enters into step S109 then, judges at this step S109 whether checking data is wrong.Preferably, according to one embodiment of the present invention, when the operation streaming media service is used, combine with the Streaming Media application module, the ad-hoc location in each piece inserts special check information; And, the content that is read is carried out verification based on the check information that is inserted, whether wrong to determine rigid disk storage contents.If determine that rigid disk storage contents is wrong, then process enters step S110, and else process enters step S111.
In step S110, the data that rigid disk storage contents is wrong are handled, to avoid influence to streaming media service.At the wrong processing of rigid disk storage contents, will 4 be elaborated with reference to the accompanying drawings.
The method according to this invention also comprises, upgrade and on average read or write performance in a period of time of this hard disk, and periodically feed back to the Streaming Media application module, the nearest readwrite performance of hard disk I/O as the ability of streaming media service with reference to one of, shown in step S111.
In step S112, further judge on average reading or writing performance and whether being lower than threshold value of this hard disk.If the performance that on average reads or writes of this hard disk is lower than threshold value, then process enters into step S113, otherwise process ends comes back to step S101.
When hard disk performance cross low or bad piece too much, hard disk can't continue normal service, the processing of being correlated with reduces and avoids influence to streaming media service, shown in step S113.Processing at hard disk corruptions will 5 be elaborated with reference to the accompanying drawings.
Fig. 2 illustrates hard disk transient error of the present invention and handles synoptic diagram.With reference to Fig. 2, processing further illustrates to the hard disk transient error shown in the step S103 among Fig. 1, and the flow process that the hard disk transient error is handled is:
During overtime or this time I/O operation failure, enter the treatment scheme of hard disk transient error, when the stand-by period of I/O operation shown in step S201.
Further in step S202, judge it is overtime or this time I/O operation failure of stand-by period of this I/O operation.If this time I/O operation failure, then process enters into step S203, is that the stand-by period of this I/O operation is overtime else if, and then process enters into step S206.
Further in step S203, judge that this time I/O operation is read operation or write operation.If this time I/O is operating as read operation, then process enters into step S205, and this time I/O is operating as write operation else if, and then process enters into step S204.
In step S204, apply for that a new piece on this hard disk carries out write operation.
In step S205, the notification streams media services are used, and only switch the Streaming Media relevant with this read operation to other node.
In step S206, judge that this time I/O is operating as read operation or write operation.If be operating as read operation for this time I/O, then process enters into step S207, and this time I/O is operating as write operation else if, and then process enters into step S208.
In step S207, the notification streams media services are used, and the relevant streaming media service of all meeting reader hard disks is all switched on other node that stores related content.
In step 208, all data that this hard disk is write in the buffer zone are all write other hard disk in this node, flow process finishes.
Fig. 3 illustrates bad piece of hard disk of the present invention and handles synoptic diagram.With reference to Fig. 3, the step S105 among Fig. 1 is illustrated further the flow process that bad piece of hard disk is handled is:
At step S301,, just enter the bad piece treatment scheme of hard disk when I/O operation failure or overtime statistics reach bad block threshold value.
In step S302, this piece is set to bad piece, and upgrades the bad block message on the double copies hard disk, and this bad piece will never re-use.
Further, at step S303, judge whether this bad piece has memory contents to need to recover.Wherein, owing to the bad piece that read operation causes need be done the data recovery, recover and do not need to do data by the bad piece that write operation causes.Data are recovered if desired, enter step S304, otherwise this flow process leave it at that.
Carry out the bad piece that data are recovered for needs, process enters into step S304, distributes a new hard disk piece in this hard disk.
In step S305, judge whether the data on this bad piece have backup at other hard disk of this intranodal.If backup is arranged, then process enters into step S308, otherwise process enters into step S306.
At step S306, search the node location of other backup of this media file.Because the media file in the stream media service system at least all can be preserved the backup more than 2 parts, so this search operation total energy success.And those skilled in the art can know, can use various lookup methods.
At step S307, initiate the document copying recovery request to the node that stores this media file.
And the data corresponding at step S308 copy arrive newly assigned.
Fig. 4 illustrates the wrong processing synoptic diagram of rigid disk storage contents of the present invention.With reference to Fig. 4, the step S110 among Fig. 1 is illustrated further the flow process of the wrong processing of rigid disk storage contents is:
When reading the I/O data, enter rigid disk storage contents fault processing flow process, shown in step S401 through the verification discovery is wrong fast.
At step S402, the notification streams media services are used, and switch the Streaming Media relevant with this read operation to other node.
At step S403, search the node location of other backup of this media file.
At step S404, initiate the document copying recovery request to the node that stores this media file.
And, at step S405, copy corresponding data, and the data that copied are write back this piece again.
Fig. 5 illustrates hard disk corruptions of the present invention and handles synoptic diagram.With reference to Fig. 5, processing further illustrates to the hard disk corruptions shown in the step S113 among Fig. 1.The flow process that hard disk corruptions is handled is:
When hard disk performance cross low or bad piece too much, hard disk can't continue normal service, enters the hard disk corruptions treatment scheme, shown in step S501.
At step S502, the notification streams media services are used, and switch all streaming media services relevant with this hard disk to other node.
At step S503, notice upper strata memory management module stops the relevant I/O operation of this hard disk.
At step S504, in this node storage management system, isolate this hard disk.
At step S505, send user's alarm, require to change hard disk.
It will be recognized by those skilled in the art that above description about the inventive method is based on that hard disk is described.But the present invention is not limited to the such storage medium of hard disk especially.For example the present invention can be applied to based on hard disk, SCSI (Small Computer SystemInterface for example, small computer system interface) hard disk, SATA (Serial-ATA, serial ports ATA (Advanced Technology Attachment, the advanced techniques optional equipment)) various types of hard disks of hard disk, IDE (Integrated Drive Electronics, integrated drive electronic equipment) hard disk, optical fiber hard disk or the like.
More than describe the specific embodiment of the present invention with reference to the accompanying drawings.But those skilled in the art can understand, and under situation without departing from the spirit and scope of the present invention, can also do various changes and replacement to the specific embodiment of the present invention.These changes and replace all drop in claims of the present invention institute restricted portion.

Claims (14)

1. a Hard disk error of using based on streaming media service detects and fault-tolerance approach, it is characterized in that,
Each I/O operation of using based on streaming media service is monitored, and wherein said monitoring comprises the stand-by period of this I/O operation of monitoring and the result that system returns this I/O operation, and the quick check results of data during read operation;
Monitored results to the operation of the repeatedly I/O in the predetermined lasting time is added up;
Based on the monitored results of each I/O operation or the statistics of the monitored results of I/O operation repeatedly, determine whether occur hard disk transient error, bad piece of hard disk in the hard disk, rigid disk storage contents is wrong and the Hard disk error of hard disk corruptions; And
Should be used as corresponding fault-tolerant processing in conjunction with streaming media service.
2. the method for claim 1 is characterized in that,
The stand-by period of judging each I/O operation, whether whether overtime or each I/O operation failed; Whether wherein, stand-by period and predefined first threshold that each I/O is operated compare, overtime to judge this I/O operation;
When the stand-by period of this time I/O operation, return results overtime or that system operates this time I/O was failure, determine to occur transient error.
3. method as claimed in claim 1 or 2 is characterized in that, when determining transient error to occur, carries out corresponding fault-tolerant processing at transient error, and described processing comprises:
When the stand-by period of read operation surpassed the first threshold that sets in advance, the notification streams media services were used, and the relevant streaming media service of all meeting reader hard disks is all switched on other node that stores related content;
Do not have overtime and directly return the situation of failure for read operation, then the notification streams media services are used, and only switch a Streaming Media relevant with this read operation and arrive other node;
When stand-by period of write operation surpasses the first threshold that sets in advance, operate for the streaming media service relevant with write operation, all data that this hard disk is write in the buffer zone are all write other hard disk in this node; And
Do not have overtime for write operation and directly return the situation of failure, apply for that then a new piece on this hard disk carries out write operation.
4. the method for claim 1 is characterized in that,
Based on when the operation Streaming Media is used, detect instantaneous operating mistake, further whether judgement is failed to the read/write operation of this piece or whether the overtime aggregate-value of the read/write operation of this piece is surpassed second threshold value that sets in advance; And
When described read/write operation failure or overtime aggregate-value are surpassed second threshold value that sets in advance, determine that this piece is a bad piece.
5. as claim 1 or 4 described methods, it is characterized in that when definite this piece is bad piece, carry out corresponding fault-tolerant processing at this bad piece, described processing comprises:
For write operation failure or overtime and sleazy bad piece that cause is designated bad piece to it, do not re-use later on;
And for read operation failure or overtime and bad piece that cause then will recover the content corresponding in this bad piece.
6. method as claimed in claim 5 is characterized in that, for read operation failure or overtime and bad piece that cause, the processing that the content corresponding in this bad piece is recovered further comprises:
In this hard disk, redistribute a new piece; And
Copy is corresponding to the backup of the content corresponding in this bad piece, and the backup that is copied is kept in the new piece that is distributed.
7. method as claimed in claim 6 is characterized in that, described copy further comprises corresponding to the backup of the content corresponding in this bad piece:
If the content in this bad piece has backup on other hard disks of this intranodal, then from its copy content corresponding;
Otherwise, if the not backup on other hard disks of this intranodal of the content in this bad piece then by searching, determines to store other node of this content, and from this other node copy content corresponding.
8. the method for claim 1 is characterized in that, further comprises:
If the stand-by period of this I/O operation is not when having result overtime and that system returns this I/O operation not to be failure;
When the operation streaming media service is used, combine with the Streaming Media application module, the ad-hoc location in each piece of hard disk inserts special check information; And
Based on the check information that is inserted, the content that is read is carried out verification, whether wrong to determine rigid disk storage contents.
9. as claim 1 or 8 described methods, it is characterized in that when detecting rigid disk storage contents when wrong, the notification streams media services are used, switches the Streaming Media relevant and arrive other node, the process of the line data of going forward side by side recovery with this read operation.
10. method as claimed in claim 9 is characterized in that, the process that described data are recovered further comprises:
Search the node location of other backup that stores this media file;
Initiate the document copying recovery request to the node that stores this media file; And
Copy corresponding data, and the data that copied are write back this piece again.
11. the method for claim 1 is characterized in that, further comprises:
In described predetermined lasting time, whether the I/O performance of judging hard disk crosses low or bad piece is too much;
When the I/O performance that detects hard disk is lower than predefined the 3rd threshold value and can't continue to serve the time, perhaps when the quantity of the bad piece of hard disk surpasses predefined the 4th threshold value, determine hard disk corruptions.
12. as claim 1 or 11 described methods, it is characterized in that, further comprise:
Based on each I/O operation, calculate the average read of hard disk in a period of time; And
Periodically give the streaming media service application module in allowed limits with the I/O performance feedback.
13. as claim 1 or 11 described methods, it is characterized in that, when determining hard disk corruptions:
The notification streams media services are used and are switched all related streams media services, and notice upper strata memory management module stops the relevant I/O operation of this hard disk, isolates this hard disk in this node storage management system, and sends user's alarm, requires to change hard disk.
14. one kind is used the hard disk of method according to claim 1, described hard disk comprises: any in SCSI hard disk, SATA hard disk, IDE hard disk or the optical fiber hard disk.
CN200710161212A 2007-09-25 2007-09-25 Hard disc error detection and fault-tolerant method in stream media uses Expired - Fee Related CN100595839C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200710161212A CN100595839C (en) 2007-09-25 2007-09-25 Hard disc error detection and fault-tolerant method in stream media uses

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200710161212A CN100595839C (en) 2007-09-25 2007-09-25 Hard disc error detection and fault-tolerant method in stream media uses

Publications (2)

Publication Number Publication Date
CN101127233A true CN101127233A (en) 2008-02-20
CN100595839C CN100595839C (en) 2010-03-24

Family

ID=39095236

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200710161212A Expired - Fee Related CN100595839C (en) 2007-09-25 2007-09-25 Hard disc error detection and fault-tolerant method in stream media uses

Country Status (1)

Country Link
CN (1) CN100595839C (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750213A (en) * 2012-06-18 2012-10-24 深圳市锐明视讯技术有限公司 Disc detecting and processing method and detecting and processing system
CN102902608A (en) * 2011-07-25 2013-01-30 技嘉科技股份有限公司 Method and system of detection and data transfer for disk arrays
CN103631721A (en) * 2012-08-23 2014-03-12 华为技术有限公司 Method and system for isolating bad blocks in internal storage
CN103942112A (en) * 2013-01-22 2014-07-23 深圳市腾讯计算机系统有限公司 Magnetic disk fault-tolerance method, device and system
CN104572353A (en) * 2015-01-21 2015-04-29 浪潮(北京)电子信息产业有限公司 Disaster recovery fusion management method and system
CN104866411A (en) * 2015-06-08 2015-08-26 北京奇虎科技有限公司 Monitoring and analyzing method and device for solid state disks
CN105468484A (en) * 2014-09-30 2016-04-06 伊姆西公司 Method and apparatus for determining fault location in storage system
CN105528315A (en) * 2014-09-28 2016-04-27 华为数字技术(成都)有限公司 Hard disk IO timeout control method and apparatus
CN105556460A (en) * 2013-08-29 2016-05-04 慧与发展有限责任合伙企业 Error display module
CN105578125A (en) * 2014-11-11 2016-05-11 华为数字技术(成都)有限公司 Video monitoring method and device
CN105808162A (en) * 2016-02-26 2016-07-27 四川效率源信息安全技术股份有限公司 Method for intelligently controlling defective storage data reading
CN106227467A (en) * 2016-07-18 2016-12-14 烟台蓝洋电子科技有限责任公司 The read-write drive system of storage device based on ATA and method
CN106970851A (en) * 2016-01-14 2017-07-21 阿里巴巴集团控股有限公司 Method and apparatus for disk detection process in distributed file system
CN107273231A (en) * 2016-04-07 2017-10-20 阿里巴巴集团控股有限公司 Distributed memory system hard disk tangles fault detect, processing method and processing device
WO2018040115A1 (en) 2016-09-05 2018-03-08 Telefonaktiebolaget Lm Ericsson (Publ) Determination of faulty state of storage device
CN108108131A (en) * 2017-12-29 2018-06-01 北京联想核芯科技有限公司 A kind of data processing method and device of SSD hard disks
CN109445707A (en) * 2018-11-01 2019-03-08 新疆凯力智慧电子科技有限公司 Processing method and system when a kind of hard disk write operation fails
CN113741815A (en) * 2021-08-25 2021-12-03 苏州浪潮智能科技有限公司 Control method, device and equipment of storage system and readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2601452Y (en) * 2002-09-06 2004-01-28 北京曙光天演信息技术有限公司 Warm plug-in hard disk cartridge with monitoring setup
CN2658829Y (en) * 2003-11-07 2004-11-24 联想(北京)有限公司 Movable storage device
CN100349143C (en) * 2004-05-26 2007-11-14 仁宝电脑工业股份有限公司 Method for making multiple main partitions in IDE hand disks
CN100498961C (en) * 2004-07-01 2009-06-10 华为技术有限公司 Hard disc detecting device and method

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902608A (en) * 2011-07-25 2013-01-30 技嘉科技股份有限公司 Method and system of detection and data transfer for disk arrays
CN102750213A (en) * 2012-06-18 2012-10-24 深圳市锐明视讯技术有限公司 Disc detecting and processing method and detecting and processing system
CN102750213B (en) * 2012-06-18 2016-04-06 深圳市锐明技术股份有限公司 Disk detects, disposal route and detection, disposal system
CN103631721A (en) * 2012-08-23 2014-03-12 华为技术有限公司 Method and system for isolating bad blocks in internal storage
CN103942112B (en) * 2013-01-22 2018-06-15 深圳市腾讯计算机系统有限公司 Disk tolerance method, apparatus and system
CN103942112A (en) * 2013-01-22 2014-07-23 深圳市腾讯计算机系统有限公司 Magnetic disk fault-tolerance method, device and system
CN105556460A (en) * 2013-08-29 2016-05-04 慧与发展有限责任合伙企业 Error display module
CN105528315B (en) * 2014-09-28 2018-08-14 华为数字技术(成都)有限公司 A kind of hard disk IO manufacture timeout control methods and device
CN105528315A (en) * 2014-09-28 2016-04-27 华为数字技术(成都)有限公司 Hard disk IO timeout control method and apparatus
CN105468484A (en) * 2014-09-30 2016-04-06 伊姆西公司 Method and apparatus for determining fault location in storage system
US10346238B2 (en) 2014-09-30 2019-07-09 EMC IP Holding Company LLC Determining failure location in a storage system
CN105468484B (en) * 2014-09-30 2020-07-28 伊姆西Ip控股有限责任公司 Method and apparatus for locating a fault in a storage system
CN105578125B (en) * 2014-11-11 2019-10-18 华为数字技术(成都)有限公司 A kind of video monitoring method and device
CN105578125A (en) * 2014-11-11 2016-05-11 华为数字技术(成都)有限公司 Video monitoring method and device
CN104572353B (en) * 2015-01-21 2018-01-09 浪潮(北京)电子信息产业有限公司 A kind of standby fusion management method and system of calamity
CN104572353A (en) * 2015-01-21 2015-04-29 浪潮(北京)电子信息产业有限公司 Disaster recovery fusion management method and system
CN104866411A (en) * 2015-06-08 2015-08-26 北京奇虎科技有限公司 Monitoring and analyzing method and device for solid state disks
CN106970851A (en) * 2016-01-14 2017-07-21 阿里巴巴集团控股有限公司 Method and apparatus for disk detection process in distributed file system
CN105808162B (en) * 2016-02-26 2019-04-23 四川效率源信息安全技术股份有限公司 A kind of method that intelligent control defect storing data is read
CN105808162A (en) * 2016-02-26 2016-07-27 四川效率源信息安全技术股份有限公司 Method for intelligently controlling defective storage data reading
CN107273231A (en) * 2016-04-07 2017-10-20 阿里巴巴集团控股有限公司 Distributed memory system hard disk tangles fault detect, processing method and processing device
CN106227467B (en) * 2016-07-18 2019-07-26 烟台蓝洋电子科技有限责任公司 The read-write drive system and method for storage equipment based on ATA
CN106227467A (en) * 2016-07-18 2016-12-14 烟台蓝洋电子科技有限责任公司 The read-write drive system of storage device based on ATA and method
WO2018040115A1 (en) 2016-09-05 2018-03-08 Telefonaktiebolaget Lm Ericsson (Publ) Determination of faulty state of storage device
EP3507695A4 (en) * 2016-09-05 2020-05-06 Telefonaktiebolaget LM Ericsson (publ) Determination of faulty state of storage device
US11977434B2 (en) 2016-09-05 2024-05-07 Telefonaktiebolaget Lm Ericsson (Publ) Determination of faulty state of storage device
CN108108131A (en) * 2017-12-29 2018-06-01 北京联想核芯科技有限公司 A kind of data processing method and device of SSD hard disks
CN109445707A (en) * 2018-11-01 2019-03-08 新疆凯力智慧电子科技有限公司 Processing method and system when a kind of hard disk write operation fails
CN113741815A (en) * 2021-08-25 2021-12-03 苏州浪潮智能科技有限公司 Control method, device and equipment of storage system and readable storage medium
CN113741815B (en) * 2021-08-25 2023-06-13 苏州浪潮智能科技有限公司 Storage system management and control method, device and equipment and readable storage medium

Also Published As

Publication number Publication date
CN100595839C (en) 2010-03-24

Similar Documents

Publication Publication Date Title
CN100595839C (en) Hard disc error detection and fault-tolerant method in stream media uses
CN100504795C (en) Computer RAID array early-warning system and method
US7877632B2 (en) Storage control apparatus and failure recovery method for storage control apparatus
CN101887351B (en) Fault-tolerance method and system for redundant array of independent disk
CN100530125C (en) Safety storage method for data
US7210071B2 (en) Fault tracing in systems with virtualization layers
CN110750213A (en) Hard disk management method and device
CN1746854A (en) The device, method and the program that are used for control store
JP2005322399A (en) Maintenance method of track data integrity in magnetic disk storage device
KR20060043873A (en) System and method for drive recovery following a drive failure
CN103207820A (en) Method and device for fault positioning of hard disk on basis of raid card log
CN103136075A (en) Disk system, data retaining device, and disk device
WO2015063889A1 (en) Management system, plan generating method, and plan generating program
US8782465B1 (en) Managing drive problems in data storage systems by tracking overall retry time
US8423776B2 (en) Storage systems and data storage method
US20070234107A1 (en) Dynamic storage data protection
US20130212429A1 (en) Storage device replacement method, and storage sub-system adopting storage device replacement method
US20090319822A1 (en) Apparatus and method to minimize performance degradation during communication path failure in a data processing system
CN108170375B (en) Overrun protection method and device in distributed storage system
CN106990918A (en) Trigger the method and device that RAID array is rebuild
CN102446123B (en) Method and device for processing SCSI sensing data
JP2001522089A (en) Automatic backup based on disk drive status
JP2000132413A (en) Error retry method, error retry system and its recording medium
CN105760261A (en) Business IO (input/output) processing method and device
EP2616938B1 (en) Fault handling systems and methods

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: UTSTARCOM (CHINA) INCORPORATED

Free format text: FORMER OWNER: UT STARCOM COMMUNICATION CO., LTD.

Effective date: 20121218

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 310053 HANGZHOU, ZHEJIANG PROVINCE TO: 100027 HAIDIAN, BEIJING

TR01 Transfer of patent right

Effective date of registration: 20121218

Address after: 100027, Beijing, Haidian District, Huayuan Road No. 4 Tong Heng building, room B07

Patentee after: UTSTARCOM (CHINA) CO.,LTD.

Address before: 310053 No. six, No. 368, Binjiang District Road, Zhejiang, Hangzhou

Patentee before: UTSTARCOM TELECOM Co.,Ltd.

CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: Room B07, Tongheng Building, 4 Garden Road, Haidian District, Beijing

Patentee after: UT Starcom (China) Co.,Ltd.

Address before: Room B07, Tongheng Building, 4 Garden Road, Haidian District, Beijing

Patentee before: UTSTARCOM (CHINA) CO.,LTD.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190416

Address after: 518000 Lenovo Building, No. 016, Gaoxin Nantong, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province, on the east side of the third floor

Patentee after: UD NETWORK CO.,LTD.

Address before: Room B07, Tongheng Building, 4 Garden Road, Haidian District, Beijing

Patentee before: UT Starcom (China) Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100324