Method of real-time disk-scheduling, disk-scheduler, file system, data storage system and computer program product
The invention relates to a method of real-time disk-scheduling wherein pending requests to the disk, to be effectively processed, are arranged by taking for a request into account an estimation of disk service time under a given condition of a performance behavior of the disk. Further the invention relates to a disk-scheduler wherein pending requests to the disk are arranged to be effectively processed in real-time by taking for a request into account an estimation of service time of the disk under a given condition of a performance behavior of the disk. The invention also leads to a file system, a data storage system and a computer program product.
For real-time hard disk data processing usually a real-time file system is to be used. Conventional data processing systems are adapted to aim for a maximum data integrity, for instance by completing a command only until properly executed. Traditional data- orientated data processing systems have no real-time requirements and aiming for maximum data integrity is properly applied primarily for a traditional type of data, like information technology data, which are reliability-critical data. For such information technology data and/or reliability critical data, a device should not return errors to the host regarding a transfer of information technology data until all possible data recovery procedures have been exhausted.
However, such concept has major disadvantages for streaming of data like audio-video data which demand high processing performance and effectiveness. The layout of data, a host, a periphery device and a file system usually have to be adapted to a concept of real-time processing. Stream data, such as audio-video data, are to be processed within certain time limits. A stream is the time-based transfer of data to or from a device. Stream data is defined as time-critical data unlike the above defined information technology data. Therefore, data streaming has to fulfill a variety of different requirements. A stream is composed of one or more allocation units. An allocation unit is the smallest logically contiguous group of blocks on a storage medium. Each unit may be accessed by one or more requests to the disk. All logical block addresses associated with a single allocation unit are logically contiguous. The number of logical block addresses in the requests and
allocation units are important in order to maintain the stream data rate. For instance, the request should be chosen such that the stream transfer rate requirements of a storage medium or a disk are not affected.
Within such a concept, stream data and, in particular, audio-video data have a time-based orientation. Such data should be delivered within certain time constraints.
Otherwise, such data in principle become valueless. For example, late delivery of data for a new video frame during a movie playback will cause visible artifacts such as skipping, audio noise or video frame corruption. Whereas information technology data may be delayed without visible results, stream data have to be processed with regard to strict time-orientation. In typical real-time systems or applications the quality of data therefore rather is of secondary interest. Instead the timeliness of delivery of the data is of paramount importance.
For the purpose of real-time requirements a real-time file system usually comprises a hard-disk request scheduler. The hard-disk request scheduler arranges requests from the file system in a proper way such that the requests are effectively processed. In particular, it is to be guaranteed that real-time streaming performances requirements are met. E.g. within a single request it should be guaranteed that no seeks are needed. Only between two subsequent requests or accesses a seek may be allowed without affecting the stream transfer rate requirements. E.g. for such purpose a seek and a rotational latency of a disk should be taken into account. Therefore, a hard-disk scheduler takes into account for a request an estimation of service time of the disk under a given condition of a performance behavior of the disk. In general a hard-disk request scheduler predicts a nominal service time for each request.
This in particular means, that the actual service time of a hard-disk request should not exceed the service time estimated by the scheduler.
A contemporary real-time file system is in principle described in WO 01/25892 Al. Therein a bandwidth allocator is utilized to allocate a bandwidth of the storage system or disk. This information is used to order requests to the disk in a guaranteed rate queue according to a deadline. Further requests are ordered in a non-rate-guaranteed queue according to a priority.
However, known contemporary approaches to estimate a service time of a disk are restricted to non-real-time applications. Such approaches are basically known from
US 6,260,108 Bl regarding read requests and from US 5,854,941 as regarding access requests, both for non-real-time applications.
Concepts for service time estimations which are specifically adapted for realtime applications have not yet been reported. A major problem arises as contemporary concepts of real-time disk-scheduling do not take into account disk errors for an estimation of service time of the disk. A disk error may arise from physical errors on a hard-disk. Physical errors may be handled according to a prior art defect management method, as e.g. known from GB 2 285 166 A, relying on an allocation and re-allocation scheme, sometimes referred to as a skip-and-slip-scheme, for erroneous sectors on a hard-disk. Within defect management methods of such kind errors may be concealed or corrected by writing data, which are originally scheduled to be written to a defect sector of the hard-disk drive to some other area of a hard-disk drive. In this sense the erroneous sectors are referred to as re-assigned sectors. When applying such conventional scheme there are certain disadvantages. In particular, as a data transfer head usually has to perform track-switching or a seek to a remote spare sector or at least is not able to read from an originally scheduled sequence of a physical block address this will usually result in a performance penalty of the data storage disk. Also such performance penalty can arise from erroneous data which may have not been fully success written to the disk, for instance due to time restrictions. This may result in that the data in the stream stored on the disk are basically undefined. The data may be either erroneous or belong to a totally different stream. All these cases referenced above and cases of similar kind, i.e. all cases wherein data have not been written to the disk properly as predetermined by a host, are referred to as disk errors in the following.
Contemporary methods of disk-scheduling for an estimation of service time of the disk do not take into account disk errors like physical errors, re-assigned sectors or erroneous data, basically because either the locations of such disk errors on the data storage disk are not known or the performance penalty due to such disk errors is not known.
As a result of disk errors a contemporary scheduler's nominal estimation of the service time of a hard-disk request is in general too optimistic. The reason is, that disk errors are neglected. This usually means, that a data storage disk request will not meet its deadline and the requirement of real-time streaming guarantees will not be met. This may result in buffer over- or underflows. In particular, in real-time and audio- video recording applications this may result in a substantial data loss and a reduction of the quality of streaming data.
This is where the invention comes in, the object of which is to provide a method and an apparatus capable of improving an estimation of a service time of a data storage disk and/or data storage system to achieve a more realistic prediction of the service time of a disk request and thereby increasing a performance efficiency of the disk and/or the data storage system.
As regards the method, the object is achieved by a method as referenced in the introduction, wherein in accordance with the invention the estimation is performed on basis of a disk model describing the performance behavior of the disk and the estimation takes into account a reduction of the performance behavior of the disk caused by a disk error.
It has been realized by the invention that, by applying the proposed concept, disk request deadline violations resulting in buffer over- and underflows are avoided even, when disk errors like physical errors, re-assigned sectors or erroneous data, which usually result in a performance penalty, are accessed. The use of such performance penalty information in course of a real-time disk-scheduling concept results in a more realistic and more reliable estimation of the performance behavior of the disk and therefore in a more reliable and more realistic prediction of the service time of the disk. An essential advantage of the proposed concept is, that the scheduler is able to guarantee a real-time performance of a disk more reliably with regard to streaming data, whereas in prior art concepts buffer over- and underflows often result from incorrect time estimations. The proposed concept allows to increase the performance efficiency of the disk and/or a data storage system. In particular also a host system's performance is also advantageously affected.
Developed configurations of the inventive method are further outlined in the dependent claims.
Most preferably the estimation is performed by means of a disk-scheduler. The disk scheduler is advantageously part of a real-time file system, in particular in a file system layer of a real-time file system. For the estimation the disk-scheduler may use the disk model. The applied disk model should be able to describe the complete behavior of the hard-disk. Such model may comprise several time parameters like a host and a device queue time, which mainly take into account the time that a request waits in a host or a disk while previous requests are being served. Further mechanical parameters of the disk, in particular the disk head, may be implemented, like a seek-time or a jump-time. Time parameters taking into account the rotation of the disk may be implemented as a rotational latency time or a
rotational transfer time. Furthermore bus delays like a bus busy time and a bus transfer time may also be implemented to provide a complete model for the performance behavior of the hard disk and its periphery. By further taking into account the reduction of the performance behavior of the disk caused by a disk error according to the proposed concept a more realistic and more reliable estimation of the service issued, the proposed concept is able to predict a more reliable service time for such a request.
As in most cases neither the location of a disk error nor a performance penalty due to the disk error are known and it is particular preferred in a further developed configuration of the method that in a first step the disk error is localized and in a further step the performance penalty due to the disk error is determined to determine the reduction of the performance behavior of the disk due to the disk error. Such reduction of the performance behavior of the disk subsequently maybe taken into account for the estimation of service time of the disk as outlined above.
As regards a localization of a disk error this may be achieved in several advantageous ways. A disk error may either be localized by a specific measurement or in course of an access to the disk, e.g. a read or a write access. A write access is particular advantageous as an occurring error may be obtained already when writing to the disk. Such error is already known then, when reading the disk. Thereby further delays are almost prevented. Also it may be advantageous to take the logical block address of the disk error from a log-list provided by the disk drive. It is particular preferred that the logical block address of the disk error is identified and advantageously registered at a file system layer.
As regards the performance penalty of the disk error, the penalty may be determined either by measurement or calculation, the latter preferably on basis of properly provided data. It is particularly preferred to localize the disk error and/or to determine a performance penalty due to the disk error in course of a single measurement wherein one or more accesses to the disk are made. Further developed configurations thereof are outlined in the dependent claims 9 to 12. Preferred embodiments of these and other configurations of the proposed method are outlined in the detailed description. As regards the apparatus, the object is achieved by a disk-scheduler as mentioned in the introduction, wherein according to the invention the disk-scheduler comprises: means for performing the estimation on basis of a disk model describing the performance behavior of the disk and
means for providing information on a reduction of the performance behavior of the disk caused by a disk error.
The means for performing the estimation may be any kind of software code section capable to calculate on basis of a disk model a service time for a pending request to the disk to thereby describe the performance behavior of the disk.
The means for providing information may be any kind of data storage, data delivery or software means capable to make the information known from disk error localization and performance penalty determination available to the means for performing estimation of service time of the disk under the given performance behavior of the disk. The invention also leads to a file system comprising such disk-scheduler.
Also the invention leads to a data storage system, like e.g. a host, comprising a disk or a disk system and a file system as outlined above, wherein a device driver for communication between the disk and the file system is provided.
In a further preferred configuration the data storage system comprises an application layer on top of the file system layer wherein an application programming interface for communication between the application and the file system is provided.
The invention further leads to a computer program product storable on a medium readable by a computer system comprising a software code section which induces the computer system to execute the invention method as proposed when the product is executed on the computer system.
The invention will now be described in detail with reference to the accompanying drawing. The detailed description will illustrate and describe what is considered as a preferred embodiment of the invention. It should, of course, be understood that various modifications and changes in form or detail could readily be made without departing from the spirit of the invention. It is therefore intended that the invention may not be limited to the exact form and detail shown and described herein, nor to anything less than the whole of the invention disclosed herein and as claimed hereinafter. Further the features described in the description, the drawing and the claims disclosing the invention may be essential for the invention considered alone or in combination.
The Figure of the drawing illustrates in:
Figure 1 : A scheme of a data storage system adapted for real-time application wherein a performance penalty due to a disk error is determined and provided to a hard-disk-
drive scheduler and further is taken into account in course of an estimation of service time of the disk under a given condition of a performance behavior of the disk.
Figure 1 shows a preferred embodiment of a layered architecture of a data storage system having an application layer 1 which uses a real-time file system 2a to access a hard disk 4 via a device driver 3. The real-time file system 2a comprises a file system layer 2 for communication with two application programming interfaces 5, 6. One application programming interface 5 is adapted mainly for non-real-time applications, i. e. for best-effort PC-like file access, that provides normal file access functions like OPEN, CLOSE, READ, WRITE, SEEK, etc. to handle information technology data. A second application programming interface 6 is adapted for treating files as real-time streams by means of functions to access files as real-time streams with, for example audio-visual data, like START_STREAM, STOP_STREAM, PAUSE_STREAM, etc. to handle stream data. Further the file system 2 creates requests and issues these requests to a hard-disk request scheduler 7. The scheduler 7 determines the type of request, i. e. best-effort or real-time. Further the hard-disk drive-scheduler 7 arranges and orders the pending requests to the disk such that they are effectively processed. Also the scheduler predicts a nominal service time of the disk wherein an estimation of service time is performed on basis of a disk model describing the performance behavior of the disk and the estimation takes into account a reduction of the performance behavior of the disk caused by a disk error. The disk errors are identified and registered either in the file system layer 2a or in the device driver layer 3. They are reported to the scheduler 7 by the disk 4 via the driver 3. Read and write accesses and respective communication of requests and information are indicated by arrows in Figure 1. In the preferred embodiment realistic service time predictions are made by the scheduler. Consequently, the file system and/or the data storage system is capable to meet the real-time requirements within a deadline predetermined according to the service time predictions.
As a basis of the outlined estimation in the preferred embodiment a disk error is localized on the disk 4 and a performance penalty due to the disk error is determined by the scheduler 7.
In a first modification of the preferred embodiment regarding the localization of a disk error the disk error is localized by a specific measurement wherein one or more
accesses to the disk are made. Such specific measurement may be made once, merely for the purpose to localize a disk error and identify the logical block address of the disk error.
In a further modification of the preferred embodiment regarding the localization of a disk error, such measurement may also be performed during or immediately subsequent an access to the disk, either a write or read access. The logical block address of disk errors and/or other relevant data may be registered by any means such that the registered data are accessible and available for the system and may be provided to the scheduler in a proper and efficient way.
One possibility is, for instance, to register the disk error at a file system layer preferably by means of the disk-scheduler. Another possibility also is to register these information in a log file storable on a storing medium or storage device, i. e. on the disk. Such log file could be read by the scheduler, if information about a disk error is requested. A particular preferred further possibility is, that the logical block address of the disk error is taken from a log-list already provided by the disk drive. Such log-list is, for instance, provided by a hard-disk supporting the streaming feature set as part of the ATA- standard of future hard-disk drives. In particular a log-list is preferably supported by a feature set specifically adapted for logging. Such feature set is for instance the general purpose logging feature set.
The ATA-standard (AT Attachment Interface) provides a common attachment interface for system manufacturers, system integrators, software suppliers and suppliers of intelligent storage devices with regard to data streaming and real-time requirements. Such a standard includes a packet command feature set implemented by devices commonly known as AT API devices. The streaming and/or advantageously the general purpose logging feature set is an optional feature set that allows the host to request delivery of data from a contiguous logical block address range within an allotted time, the priority being placed on the time taken to access the data rather than on the integrity of the data. Such feature set comprises the mentioned log-list of all sectors and the logical block addresses that have been re-assigned. Therefore for such future art disk drives comprising such ATA-standard feature set a specific measurement to localize a disk error and to identify its logical block address is unnecessary which is a particular advantage of such modified embodiment.
Advantageously, the scheduler takes all sectors with a performance penalty into account, however, if a log-list of re-assigned sectors is available, this will already result in a significant increase of performance efficiency of the hard-disk and the data storage system.
Once a logical block address of a disk error, which may be any kind of error like erroneous data, re-assigned errors or physical disk errors, is known, still the performance penalty due to the disk error is to be determined. In the preferred embodiment several modifications are available to determine such performance penalty. In a first modification of the preferred embodiment regarding performance penalty determination, for instance access time for, three or more subsequent sectors may be measured including the disk error and neighboring error-free locations of the disk. Based on such a read access a worst-case value in access time is estimated and used to estimate a reduction of the performance behavior of the disk. Further modifications, of course, may be implemented without departing from the spirit of the proposed concept. For instance, the estimation of service time may take into account the starting point of a read/write head of the disk and the aiming address of the head. Further the load of the disk and the total number of disk errors can be implemented in a proper way. Also one may choose an estimated or measured or calculated value indicating a time to access a location of the disk under a given condition of a certain amount of disk errors.
In a particular preferred embodiment, a disk error is localized on the disk and a performance penalty due to the disk error is determined by an advantageous method as described in the following. The steps thereof may be executed on an arbitrary predetermined and preliminary range of sectors/blocks on the disk.
In a first step, access is made to a number of sectors/blocks of the range. In particular, a read/write access is made to the number of sectors from /to the disk. In a further step, the service time for this access is measured.
Then in another step the measured service time is compared with an estimated service time as predicted by the estimation performed on basis of a disk model describing the performance behavior of the disk. Here the estimated service time not yet takes into account a reduction of the performance behavior of the disk caused by a disk error and will be referred to as the estimated nominal service time, which will be in general too optimistic.
In still a further step, it is checked whether the measured service time exceeds the estimated nominal service time. If this is the case, then the selected range of sectors might contain a bad sector producing a performance penalty. Here the performance penalty may already be defined as the difference value between the measured service time and the estimated optimistic service time. Also, if preferred, already here the measured service time or a worst-case value thereof may be regarded as the realistic service time and may be taken
into account for the estimation as a reduction of the performance behavior of the disk caused by the disk error.
In a more elaborated configuration of the above method the above referenced steps may be repeated to confirm the pending result, that the selected range of sectors might contain a bad sector producing a performance penalty. If, upon repeating the referenced steps, the selected sector range still results in a performance penalty, then, there is a reliable probability, that indeed a bad sector exists in the selected sector range.
To localize the disk error, i.e. the bad sector, in principle, two possible ways may be applied. Each one of the possibilities may be chosen advantageously depending on the situation.
In a first possibility at least the above referenced first four steps are repeated and applied to each one of the sectors in the range to thereby identify the bad sector.
In a second possibility a fast binary search may be executed to find the bad sector in the range. A binary search in particular comprises the steps of: - splitting the selected range of sectors into two halves, measuring the service time of one of the halves, continuing further processing with the half, which is producing the performance penalty, repeating the yet referenced first three steps of the binary search for the half producing the performance penalty, repeating the referenced steps of splitting, measuring and continuing with the penalty producing half until the available half contains merely the bad sector.
In the measurement of the performance penalty a first access time for accessing the disk error location is compared to a reference time significant for accessing an error free location on the disk to determine a difference-value in access time. In principle the reference time may be a second access time, however, it is more advantageous to take the estimated nominal service time as outlined above. As an example, if the first access time would be 5 milliseconds and the reference time would be 3 milliseconds, the difference value in access time, i.e. the performance penalty, amounts to 2 milliseconds. Also depending on the amount and localization of disk errors such difference value may be used to be taken into account for estimation of the reduction of the performance behavior of the disk.
Once a location of an error is known and once the value is known how much longer it takes to access a disk error location as compared to an error-free location on the disk, this knowledge is implemented to estimate the service time of the disk on basis of the
model describing the whole behavior of the disk. Such value may be chosen as an average value. Nevertheless in general advantageously a worst-case value is preferred as an average value may in certain cases be too optimistic and therefore may negatively affect or even destroy real-time guarantees. In any way in the preferred embodiment, a too optimistic prediction of the deadline for a real-time request is fairly prevented by above outlined measure and therefore real-time streaming requirements are guaranteed in a realistic way and are met by the hard- disk drive. Consequently, buffer over- and underflows which could have resulted in substantial data loss or reduction of audio-video quality, is prevented by the proposed preferred embodiment and modifications thereof.
The Real-Time File System 2a or just the scheduler 7 may be implemented in a data storage system 200 as presented in Figure 2, which is an embodiment of the data storage system according to the invention. The data storage system 200 comprises a host processor 201, a ROM memory 202, a harddisk drive system 203, a RAM memory 204, a Direct Memory Access (DMA) unit 205, an I/O controller 206, connected to a multitude of connectors 207 and a disc drive 208.
The method according to the invention and further embodiments of this invention may be performed by the host processor 201. To enable the host processor 201 to perform this method, computer readable code stored in the ROM memory 202 is read by the host processor 201. In this way, the host processor 201 is enabled to optimize the performance of the harddisk drive system 203 in transfer of data between the harddisk drive system 203 and the host processor 201 or between the harddisk drive system 203 and the RAM memory 204 via the DMA unit 205.
Data may be further processed and outputted via the I/O controller 206 and the multitude of connectors 207 to a printer, monitor, loudspeaker of any other kind of output device. Data to process or to store in the harddisk drive system 203 may also be received via the I/O controller 206 and the multitude of connectors 207 from a keyboard, mouse or any other kind of input device.
In a further embodiment, the computer executable code is store on a CD-ROM 250, from which data can be read by means of the disc drive 208. The data read from the CD- ROM is stored in the host processor 201.
In summary, disk errors, like erroneous sectors, sector re-assignments and erroneous data on a disk, usually result in a performance penalty of a data storage disk, like a hard-disk. Contemporary concepts of real-time disk scheduling do not take into account such
performance penalty for scheduling pending requests to the disk and therefore a contemporary scheduler's nominal estimation of the service time of a hard-disk in general is too optimistic. Consequently a contemporary hard-disk is very likely to miss its service time restrictions set by the scheduler. Such performance penalty is in general not known in time. The invention proposes a concept of real-time disk-scheduling wherein pending requests to the disk are arranged to be effectively processed by taking for a request into account an estimation of service time of the disk under a given condition of a performance behavior of the disk. It is proposed, that the estimation is performed on basis of a disk model, describing the performance behavior of the disk and the estimation takes into account a reduction of the performance behavior of the disk caused by a disk error. In particular it is proposed to either calculate or measure the performance penalty and take additional time needed for data extraction into account when scheduling extraction of the data from the hard-disk drive. Thereby realistic timing estimates can be made for extracting data from a hard-disk drive system or for storing data to a hard-disk drive system. In particular such information may be required when audio-video data is stored on the hard-disk drive and real-time requirements are needed when extracting data from a mass-storage medium.