WO2003063423A1

WO2003063423A1 - Pseudorandom data storage

Info

Publication number: WO2003063423A1
Application number: PCT/US2003/002282
Authority: WO
Inventors: Roger Zimmerman; Cyrus Shahabi; Kun Fu; Shu-Yuen Didi Yao
Original assignee: University Of Southern California
Priority date: 2002-01-24
Filing date: 2003-01-24
Publication date: 2003-07-31
Also published as: WO2003063423A9

Abstract

Systems and techniques to pseudorandomly place and redistribute data blocks in a storage system (100), and also to implement continuous media systems and systems to transfer media streams between a server and a client. In general, in one implementation, the techniques include: distributing data blocks over multiple storage devices (100) according to a reproducible pseudorandom sequence that provides load balancing across the storage devices (160), and determining current storage locations of the data blocks by reproducing the pseudorandom sequence (110).

Description

PSEUDORANDOM DATA STORAGE

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of the priority of U.S. Provisional Application Serial No. 60/351,998, filed January 25, 2002 and entitled "SCADDAR: AN EFFICIENT RANDOMIZED TECHNIQUE TO REORGANIZE CONTINUOUS MEDIA BLOCKS", and the benefit of priority of U.S. Provisional Patent Application No. 60/351,656, entitled "YIMA JADE: A SCALABLE LOW-COST STREAMING MEDIA SYSTEM", filed on January 24, 2002, and the benefit of priority of U.S. Provisional Patent Application No. XX/XXX,XXX, filed January 17, 2003 with attorney reference number 06666-127P02, entitled "RETRANSMISSION-BASED ERROR CONTROL IN A MANY-TO-MANY CLIENT-SERVER ENVIRONMENT", and the benefit of priority of U.S. Provisional Patent Application No. 60/352,071, entitled "A MULTI-THRESHOLD ONLINE SMOOTHING TECHNIQUE FOR VARIABLE RATE MULTIMEDIA STREAMS," filed on January 25, 2002.

STATEMENT AS TO FEDERALLY SPONSORED RESEARCH

[0002] The invention described herein was made in the performance of work funded in part by NSF grants EEC-9529152 (IMSC ERC) and IIS-0082826 and NIH-NLM grant no. R01-LM07061 and is subject to the provisions of Public Law 96-517 (35 U.S.C. 202) in which the contractor has elected to retain title.

BACKGROUND

[0003] The present disclosure describes systems and techniques relating to storing data on multiple storage devices, such as in the context of continuous media systems and also in the context of multimedia stream delivery. [0004] When storing large amounts of data on multiple storage devices, data placement and retrieval scheduling can be important for the overall efficiency and utility of the storage system. A continuous media (CM) server is an example of a storage system where data placement can be of particular importance. CM objects, such as video and audio files, can be quite large and have significant bandwidth requirements. Moreover, CM servers typically must handle large numbers of simultaneous users.

[0005] A common solution for CM servers involves breaking CM objects into fixed sized blocks which are distributed across all the disks in the system. Convention data placement techniques for CM servers, such as round-robin striping, RAID (Redundant Array of Inexpensive Disks) striping and various hybrid approaches, can be categorized as constrained placement approaches. In a constrained placement approach, the location of a data block is fixed and determined by the placement algorithm. [0006] By contrast, a non-constrained placement approach involves maintaining a directory system to keep track of the location of the blocks, thus allowing the blocks to be placed in any location desired and moved as needed. For example, in random placement, a block's location is randomly assigned and then tracked using a directory system. Random placement can provide load balancing by the law of large numbers. Moreover, when performing data access, random placement can eliminate the need for synchronous access cycles, provide a single traffic pattern, and can support unpredictable access patterns, such as those generated by interactive applications or VCR-type operations on CM streams .

[0007] Moreover, continuous media systems may be used to provide real-time data, such as video data, audio data, haptic data, avatar data, and application coordination data, to end users. Continuous media systems face a number of challenges. First, the systems may need to be able to transmit the data from a storage location to a client location so that the client can display, playback, or otherwise use the data in real time. For example, the systems may need to provide streaming video data for realtime display of a movie. If the real-time constraints are not satisfied, the display may suffer from disruptions and delays, termed "hiccups." In order to reduce disruptions, continuous media clients generally include a buffer for storing at least a portion of the media data prior to display to the user.

[0008] Additionally, continuous media systems need to deal with large data objects. For example, a two-hour MPEG-2 video with a 4 Megabit per second (Mb/s) bandwidth requirement is about 3.6 Gigabytes (GB) in size. [0009] Available continuous media servers generally fall within one of two categories: single-node, consumer oriented systems (for example, low-cost systems serving a limited number of users), and multi-node, carrier class systems (for example, high-end broadcasting and dedicated video-on-demand systems) .

[0010] Many multimedia applications, such as news-on- demand, distance learning, and corporate training, rely on the efficient transfer of pre-recorded or live multimedia streams between a server computer and a client computer. These media streams may be captured and displayed at a predetermined rate. For example, video streams may require a rate of 24, 29.97, 30, or 60 frames per second. Audio streams may require 44,100 or 48,000 samples per second. An important measure of quality for such multimedia communications is the precisely timed playback of the streams at the client location.

[0011] Achieving this precise playback is complicated by the popular use of variable bit rate (VBR) media stream compression. VBR encoding algorithms allocate more bits per time to complex parts of a stream and fewer bits to simple parts, in order to keep the visual and aural quality reasonably uniform. For example, an action sequence in a movie may require more bits per second than the credits that are displayed at the end.

[0012] VBR compression may result in bursty network traffic and uneven resource utilization when streaming media. Additionally, due to the different transmission rates that may occur over the length of a media stream, transmission control techniques may need to be implemented so that a client buffer neither underflows or overflows. Transmission control schemes generally fall within one of two categories: they may be server-controlled or client-controlled. [0013] Server-controlled techniques generally pre-compute a transmission schedule for a media stream based on a substantial knowledge of its rate requirements. The variability in the stream bandwidth is smoothed by computing a transmission schedule that consists of a number of constant-rate segments. The segment lengths are calculated such that neither a client buffer overflow nor an underflow will occur.

[0014] Server-controlled algorithms may use one or more optimization criteria. For example, the algorithm may minimize the number of rate changes in the transmission schedule, may minimize the utilization of the client buffer, may minimize the peak rate, or may minimize the number of on-off segments in an on-off transmission model. The algorithm may require that complete or partial traffic statistics be known a-priori.

[0015] Client-controlled algorithms may be used rather than server-controlled algorithms. In a client-controlled algorithm, the client provides the server with feedback, instructing the server to increase or decrease its transmission rate in order to avoid buffer overflow or starvation.

SUMMARY

[0016] The present disclosure includes systems and techniques relating to storing data across multiple storage devices using a pseudorandom number generator, and also to various implementations of continuous media systems and systems to transfer media streams between a server and a client. According to an aspect, data blocks can be distributed over multiple storage devices according to a reproducible pseudorandom sequence that provides load balancing across the storage devices, and current storage locations of the data blocks can be determined by reproducing the pseudorandom sequence. According to another aspect, data blocks can be distributed over multiple storage devices according to a reproducible pseudorandom sequence, a selected subset of the data blocks can be pseudorandomly redistributed and information describing a storage scaling operation can be saved in response to initiation of the storage scaling operation, current storage locations can be determined based on the pseudorandom sequence and the saved scaling operation information, and the data blocks can be accessed according to the determined current storage locations .

[0017] The systems and techniques described can result in high effective storage utilization, and the locations of data blocks can be calculated quickly and efficiently both before and after a scaling operation. A series of remap functions may be used to derive the current location of a block using the pre-scaling location of the block as a basis. Redistributed blocks may still be retrieved, as normal, through relatively low complexity computation. The new locations of blocks can be computed on-the fly for each block access by using a series of inexpensive mod and div functions. Randomized block placement can be maintained for successive scaling operations which, in turn, preserves load balancing across the disks, and the amount of block movement can be minimized during redistribution, while maintaining an overall uniform distribution, providing load balancing during access after multiple scaling operations. [0018] Scaling can be performed while the storage system stays online. As media sizes, bandwidth requirements, and media libraries increase, scaling can be performed without requiring downtime and without significantly affecting uptime services. This can be particularly advantageous in the CM server context, where data can be efficiently redistributed to newly added disks without interruption to the activity of the CM server, minimizing downtime and minimizing the impact on services of redistribution during uptime.

[0019] A storage system can be provided with a fully scalable architecture, thus reducing the importance of an initial assessment of the future amount of capacity and/or bandwidth needed. Even after many scaling operations, all blocks can be located with only one disk access. Moreover, groups of disks can be added or removed all at once in a scaling operation, and multiple scaling operations of different types (adds or removes) can be performed over time while still maintaining efficient access times to data. [0020] Moreover, these placement and scaling techniques can be applied to multiple types of systems, including a system such as described below in this summary and also in connection with FIGS. 3-18. In general, in one aspect, a system includes a plurality of data processing devices, with each data processing device coupled with at least one of a plurality of storage devices to storing data. [0021] Each of the data processing devices may include a module to retrieve a data segment from one of the coupled • storage devices. Each of the data processing devices may include a module to schedule transmission of the data segment to a client in sequence with other data segments. Each of the data processing devices may include a module to transmit the data segment to the client and not to another of the data processing devices.

[0022] At least one of the data processing devices may include a module to provide control information to transmit a data stream to a client, where the data stream comprises a sequence of data segments.

[0023] The modules may be implemented in software and/or hardware. The modules may be implemented as circuitry in one or more integrated circuits. Each of the data processing devices may be implemented as one or more integrated circuits; for example, each may be implemented in a central processing unit (CPU) . [0024] The system may also include a module to place data segments on the storage devices. The data segments may be placed using a round-robin placement technique, a random technique, or a pseudorandom technique.

[0025] The system may further include one or more network communication devices coupled to the data processing devices. For example, the system may include a local network switch to couple the data processing devices to a network.

[0026] In general, in one aspect, a method includes receiving a request for a data stream from a client. One of a plurality of nodes may be designated to provide control information for transmitting the data stream to the client.

The data stream may be transmitted as a sequence of data segments. Each of the data segments may be transmitted to the client in one or more data packets. [0027] Transmitting the data stream to the client may include transmitting a first data segment from a first node to the client according to a scheduler on the first node, and subsequently transmitting a second data segment from a second node to the client according to a scheduler module of the second node. Each of the nodes may include one or more data processing devices.

[0028] The method may further include transmitting control information from the node designated to provide control information to the first node. At least some of the control information may be provided to the scheduler of the first node. The scheduler may schedule transmission of the first data segment using the control information. [0029] The method may also include transmitting a third data segment from the node designated to provide control information. [0030] The control information may be provided according to the real-time streaming protocol (RTSP) . Data may be transmitted according to the real-time transport protocol (RTP) .

[0031] In general, in one aspect, a system includes a controller module to transmit a request for a data stream to a server having a plurality of nodes. The controller module may be configured to receive the data stream as a sequence of data segments from more than one of the plurality of nodes. The data segments may be received in one or more data packets.

[0032] The controller may include a real-time streaming protocol (RTSP) module. The controller may include a realtime transport protocol module (RTP) .

[0033] The system may also include a buffer to store at least some of the data segments. The system may also include a decoder to decode the data.

[0034] The system may include a module to determine whether there is a gap in the local sequence numbers of received data packets, where the local sequence number indicates the source node of the data packet. The system may include a memory to store local sequence numbers of packets received by the controller. The system may include a module to determine a particular node corresponding to a gap in the local sequence numbers. The system may include a module to send a retransmission request to the particular server node. The module may be included in the controller. [0035] The system may further include a user interface module. The system may further include a playback module. The system may further include one or more speakers and/or one or more displays for presenting the data stream to a user. [0036] In general, in one aspect, a method includes requesting a first data stream including a first segment of continuous media data to be presented to a user. The method may further include requesting a second data stream, the second data stream including a second segment of different continuous media data, the second segment to be presented to the user in synchronization with the first segment. [0037] The method may further include receiving the first segment from a node of a continuous media server, and receiving the second segment from a different node of the continuous media server. The method may further include decoding the first segment and the second segment. The method may further include presenting the decoded first and second segments to a user at substantially the same time. [0038] In general, in one aspect, a method may include transmitting a request for a data stream to a server including a plurality of nodes. Each of the plurality of nodes may be to store segments of the data stream and to transmit the segments of the data stream in a sequence according to a scheduler module on the respective node. [0039] The method may further include receiving a plurality of data packets from the plurality of nodes, each of the plurality of data packets including at least a portion of one of the segments, as well as a local sequence number indicating which of the plurality of nodes transmitted the respective data packet.

[0040] The method may further include determining whether a data packet was not received by detecting a gap in the local sequence number. The method may further include, if a data packet was not received, determining which of the nodes transmitted the packet that was not received using the local sequence number. The method may further include transmitting a retransmission request to the node that transmitted the data packet that was not received. [0041] Additionally, various systems and techniques are described for using a multi-threshold buffer model to smooth data transmission to a client. In general, in one aspect, a method includes receiving data such as streaming media data from a server transmitting the data at a first transmission rate. At least some of the received data is stored in a buffer. The buffer level is determined at different times. For example, a first buffer level is determined at a time, and a second buffer level is determined at a later time. The different buffer levels are compared to a plurality of buffer thresholds. For example, the first buffer level and the second buffer level are compared to the buffer thresholds to determine if one or more of the buffer thresholds is in the range between the first buffer level and the second buffer level (where the range includes the first buffer level and the second buffer level) . [0042] If at least one threshold is in the range, a second server transmission rate may be determined, based on the at least one threshold. The second server transmission rate may be predetermined (e.g., may be chosen from a list), or may be calculated.

[0043] Information based on the second server transmission rate may be transmitted to the server. For example, the second server transmission rate may be transmitted, or a change in server transmission rate may be transmitted. If the second server transmission rate is not different than the first transmission rate, rate information may or may not be transmitted to the server.

[0044] The second server transmission rate may be based on a difference between a buffer level and a target buffer level. Different methods may be used to determine second server transmission rates, depending on which threshold is in the range from the first buffer level to the second buffer level. For example, a first calculation method may be used to determine the second server transmission rate if a particular threshold is in the range, while a second calculation method may be used if a different threshold is in the range. Alternately, the second server rate may be calculated for a particular threshold, and may be chosen for a different threshold.

[0045] The second server transmission rate may be based on one or more predicted future consumption rates. Future consumption rates may be predicted using one or more past consumption rates. Future consumption rates may be predicted using a prediction algorithm. For example, an average consumption rate algorithm, an exponential average consumption rate algorithm, or a fuzzy exponential average algorithm may be used. One or more weighting factors may be used in the prediction algorithm.

[0046] In general, in one aspect, a method for transmitting data such as continuous media data includes transmitting data at a first transmission rate, receiving a communication from a client including rate change information, and transmitting additional continuous media data at a second transmission rate based on the rate change information. The rate change information may be determined using a plurality of buffer threshold levels.

[0047] The details of one or more implementations are set forth in the accompanying drawings and the description below. Other feature and advantages will be apparent from the description and drawings, and from the claims. DRAWING DESCRIPTIONS

[0048] FIG. 1 illustrates pseudorandom data storage and scaling.

[0049] FIG. 2 is a block diagram illustrating an example operational environment.

[0050] FIG. 3 is a schematic of a continuous media system. [0051] FIG. 4 is a schematic of a different implementation of a continuous media system.

[0052] FIG. 5 shows a multi-threshold buffer model. [0053] FIG. 6A shows a continuous media system having a master-slave configuration.

[0054] FIG. 6B shows a continuous media system having a bipartite configuration.

[0055] FIG. 7 is a block diagram of a panoramic video and 10.2 channel audio playback system. [0056] FIG. 8 illustrates a unicast retransmission technique.

[0057] FIG. 9 is a schematic of a system for implementing multi-threshold smoothing.

[0058] FIG. 10 is a schematic of a buffer model including multiple thresholds.

[0059] FIG. 11A illustrates arithmetic threshold spacing. [0060] FIG. 11B illustrates geometric threshold spacing. [0061] FIG. 11C shows the normalized throughput using three threshold spacing schemes.

[0062] FIG. 12 shows process steps for implementing multi- threshold smoothing.

[0063] FIG. 13 is a schematic of a fuzzy exponential average algorithm.

[0064] FIGS. 14A and 14B illustrate membership functions for the variables var and αcr. [0065] FIG. 15 shows the dependence of normalized throughput on the number of thresholds, for three buffer sizes .

[0066] FIGS. 16A to 16C show the dependence of normalized throughput on the size of the prediction window, for three buffer sizes.

[0067] FIG. 17 shows the number of rate changes for different threshold numbers and for three buffer sizes.

[0068] FIGS. 18A through 18C show the number of rate changes for different prediction window sizes, and for three buffer sizes.

DETAILED DESCRIPTION

[0069] The systems and techniques described relate to storing data across multiple storage devices using a pseudorandom number generator, and also to various implementations of continuous media systems and the transfer of pre-recorded or live multimedia streams between a server and a client. Pseudorandom data placement can result in an effectively random block placement and overall uniform distribution, providing load balancing, but without requiring maintenance of a large directory system for the data blocks. A uniform distribution means that the storage devices contain similar numbers of blocks. Blocks can be placed according to a pseudorandom sequence generated by a pseudorandom number generator seeded with a selected number.

With pseudorandom placement, a pseudorandom number X is generated for a data block, and the block can be placed on storage device (X mod N) , where N is the total number of storage devices.

[0070] The pseudorandom numbers in the pseudorandom sequence are effectively random numbers generated by a definite, nonrandom computational process. Because of this, the entire pseudorandom sequence can be regenerated as needed using the pseudorandom number generator and the original seed number. Thus, object retrieval depends on knowing the pseudorandom number generator and seed. A large directory system need not be maintained in order to retrieve the stored data. Such directory systems can become a bottleneck if stored at a central location and can require complex methods of maintaining consistency if stored in multiple locations.

[0071] Additionally, when data blocks are stored pseudorandomly, efficiencies can be obtained when performing scaling operations by minimizing the number of blocks that are redistributed during a scaling operation. This creates an easily scalable storage architecture. A scalable storage architecture allows adding of storage devices to increase storage capacity and/or bandwidth. In its general form, storage scaling also refers to storage device removals when either capacity needs to be conserved or old disks are retired. As used herein, the term "disk" refers to storage devices generally and is not limited to particular disk drive technology or machine-readable media, either magnetic or optical based.

[0072] Redistributing blocks placed in a randomized manner may require less overhead when compared to redistributing blocks placed using a constrained placement technique. For example, with round-robin striping, when adding or removing a disk, almost all the data blocks need to be moved to another disk. In contrast, with a randomized placement, only a fraction of all data blocks need to be relocated. That is, only enough data blocks are moved to fill an appropriate fraction of the new disks. For disk removal, only the data blocks on the removed disk are moved. [0073] After such a scaling operation, whether an addition or removal of one or more storage devices, the overall distribution maintains the load balancing characteristic. The redistribution with randomized placement described herein can ensure that data blocks are still essentially randomly placed after disk scaling in order to balance the load on the multiple disks. Thus the storage architecture described can be fully scalable.

[0074] FIG. 1 illustrates pseudorandom data storage and scaling. Multiple data blocks are distributed over multiple storage devices according to a reproducible pseudorandom sequence at 100. The data blocks can be any related segments of data. The data blocks can be portions of a file or portions of some other data object. Thus, although sometimes discussed in the context of continuous media servers, the systems and techniques described can be applied in other storage system contexts, such as at a storage management level (e.g., the data blocks can be extents). [0075] The pseudorandom sequence may be generated by a standard pseudorandom number generator, such as the generator defined by the standard C language library functions rand and srand. Alternatively, the pseudorandom sequence may be generated by a pseudorandom number generator tailored to a specific application. The pseudorandom sequence provides load balancing across the storage devices. For example, when the data blocks correspond to a single large file, the distribution created by the pseudorandom sequence can result in uniform load on the storage devices over the course of access to the file. In general, the pseudorandom sequence results in the blocks having roughly equal probabilities of residing on any of the storage devices.

[0076] Given a data block number in a sequence of related data blocks, X₀ is defined as the random number, with range 0...R, generated by a pseudorandom number generator for this block before any scaling operations (the subscript zero represents zero scaling operations) . The initial disk number, D₀, in which a block resides can be defined as:

(1) Do = (Xo mod N₀) where N₀ is the total number of storage devices after zero scaling operations. The disk number may or may not correspond to a specific disk of that number in a storage system, because various mappings and/or disk offsets may be used to derive a final disk location from the disk number. [0077] To compute the disk number for block i, a function - p_r ( ) , which is defined by the pseudorandom number generator, can be called i times after being seeded with a seed, s, to obtain the number X₀ for block i. The seed, s, is preferably unique for each sequence of related data blocks to be stored (e.g., a unique seed for each file). The function, p_r ( ) , returns a b-bit random number in the range of 0...R, where R is 2^b - 1. When reseeded with s, p_r ( ) will produce the identical pseudorandom sequence produced previously for that seed. Table 1 lists parameters and definitions used herein.

TABLE 1

[0078] Current storage locations of the data blocks are determined by reproducing the pseudorandom sequence at 110. Access to the data blocks, such as in order to provide continuous media data to clients, is performed according to the determined current storage locations. When a storage scaling operation is initiated, a selected subset of the data blocks are redistributed, and information describing the storage scaling operation is saved. The information can be saved in a storage structure for recording scaling operations and/or can be saved directly into a module used to determine current storage locations of the data blocks. The storage structure can be small enough to be retained entirely in fast memory, thus not requiring an access to slower media, such as a hard disk, to access the scaling operation information. After each scaling operation, an access function incorporates the saved scaling information. [0079] The storage structure can also be used to maintain the seed values and/or the disk location of the first block, if desired. Alternatively, a first block location need not be saved, and the seed values may be calculated from some other saved value. For example, the seed values can be derived from a file name corresponding the data blocks, such as by using the standard C language library function atoi() to convert the file name to a long unsigned integer. [0080] After one or more scaling operations are performed, determining the current storage locations can involve computing the current storage locations of the data blocks based on the reproduced pseudorandom sequence and the saved scaling operation information. A scaling operation involves the addition or removal of a disk group, which is one or more storage devices. A scaling operation can be initiated by a system administrator and can be performed while a storage system remains online and operational. For example, moving the data blocks can involve first copying the data blocks to their new locations, switching the storage system from using the previous set of disks to using the new set of disks, and then deleting the original block copies, such as during system idle times.

[0081] A scaling operation on a storage system with N disks either adds or removes one disk group. The initial number of disks in the storage system is denoted as N₀ and, subsequently, the number of disks after j scaling operations is denoted as N_j . During scaling operation j, a redistribution function, RF ( ) , redistributes the blocks residing on N_j_ι to Nj disks. Consequently, after scaling operation j, a new access function, AF(), is used to identify the location of a block, since its location might have been changed due to the scaling operation. [0082] Scaling up increases the total number of disks, and this means that a minimum of (N_j - N-,-ι)/N-, fraction of all the blocks should be moved onto the added disk(s) in order to maintain load balancing across the disks the storage system. Likewise, when scaling down, all blocks on the removed disk(s) should be moved and randomly redistributed across remaining disks to maintain load balancing. These block-movements are the theoretical minimum needed to maintain an even load. In the case of scaling up, blocks are only moved from old disk(s) to new disk(s) and in the case of scaling down, blocks are only moved from the removed disk(s) to the non-removed disk(s).

[0083] The original seed used to reproduce the sequence of disk locations should no longer be used in the same manner to reproduce the blocks' new sequence, because this may result in loss of the effectively random placement after a scaling operation. Ideally, a new sequence that maintains the overall randomized placement can be derived using a simple computation, with the least possible movement of blocks and the same original seed, no matter how many scaling operations are performed. Alternatively, the new sequence can be derived using a simple computation to maintain the load balancing and random distribution of data blocks with an upper limit on the number of scaling operations allowed before a full redistribution of all data blocks needs to be performed. The problem can be formally stated as:

Problem 1: Given j scaling operations on No disks, find RF() such that:

■ Redistribution Objective One [ROl] : Block movement is minimized during redistribution. Only z_} x B blocks should be moved, where:

and B is the total number of data blocks ■ Redistribution Objective Two [R02] : Randomization of all object blocks is maintained. Randomization leads to load balancing of ail blocks across all disks where E[n₀] « £[«,]« E[w₂] « •••«

. E[n_k] is the expected number of blocks on disk k.

Problem 2: Find the corresponding AF() such that:

■ Access Objective One [AOl] : CPU (Central Processing Unit(s)) and disk I/O (Input/Output) overhead are minimized using a low complexity function to compute a block location.

[0084] In order to maintain load balancing and a randomized distribution of data blocks after scaling operation j, a new pseudorandom number sequence, Xj¹, should be used to identify and track new block locations. The new sequence,

Xj¹, indicates where blocks should reside after the j^th scaling operation in that the block location is derived from

D_j, which could either indicate a new location for a block or the previous location of that block. The new sequence,

X_j ¹, should be obtainable from the original sequence, Xo¹.

[0085] If a new sequence of Xj's can be found for each scaling operation, then the block location after the j^th scaling operation can be found by computing Dj . AF() and

RF() can be designed to compute the new Xj random numbers for every block while maintaining the objectives R01, R02 and AOl. The random numbers used to determine the location of each block are remapped into a new set of random numbers

(one for each block) such that these new numbers can be used to determine the block locations after a scaling operation.

The manner in which this remapping is performed depends on whether the scaling operation is an addition of a disk group or a removal of a disk group.

[0086] If the scaling operation is an addition, new storage locations can be determined for the data blocks based on the addition of one or more storage devices at 120. Then, the selected subset of the data blocks that have determined new storage locations on the one or more added storage devices can be moved at 130. Although new storage locations are determined for all the data blocks, only those data blocks that have newly determined storage locations falling on the added storage device (s) are in the selected subset and are thus moved (i.e., selection of the blocks to be moved for an addition is based on which blocks would fall on the new disk(s) if a full redistribution were performed) . [0087] Data blocks with newly determined storage locations falling on the current storage devices are left in place, even if the newly determined location for a data block would be a different storage device of the current storage devices. Determining new storage locations for all the data blocks maintains the objective R02, whereas only moving the data blocks that fall on the new storage device (s) according to the newly determined storage locations maintains the objective ROl.

[0088] By contrast, if the scaling operation is a removal, new storage locations can be determined for the selected subset of the data blocks that reside on the one or more storage devices based on the removal of the one or more storage devices at 140. Then, the selected subset of the data blocks can be moved based on the determined new storage locations at 150. The selected subset is made up of all data blocks residing on the storage device (s) being removed (i.e., selection of the blocks to be moved for a removal is based on which blocks currently reside on the disk(s) to be removed) .

[0089] Both objectives ROl and R02 can be maintained even though new storage locations are determined only for those data blocks in the selected subset. The new storage locations are determined based on the removal of the storage device (s) because the removal can cause renumbering of the storage devices. For example, if disk 5 of a ten-disk system is removed during scaling operation j, then D_j=7 represents a different physical disk than D_j_ι=7. [0090] The information describing the storage scaling operation can be saved at 160. This information includes how many disks were added or removed, and which disk(s) in the case of a removal. The scaling operation information is then used in the future to determine current storage locations at 110. For j scaling operations, there may be up to j+1 X values calculated (Xo to Xj) to determine the location of a data block. But all of these calculations are relatively simple, using mod and div calculations and conditional branching. Thus the objective AOl is maintained as well.

[0091] Each block has a random number, Xj, associated with it, and after a scaling operation, each block has a new random number, Xj+i. Because deriving a block location, such as disk D_j, from a random number Xj is straightforward, the discussion below focuses on finding the random number X_j, for each block after a scaling operation. New random numbers can be found using a function, REMAPJ, which takes X_j-i as input and generates Xj for the scaling operation transition from j-1 to j. REMAPO corresponds to the original pseudorandom generator function.

[0092] REMAP functions are used within both the AF() and RF() functions. In particular, during scaling operation j: if disks are added, then RF() can apply a sequence of REMAP functions (from REMAP₀ to REMAPJ) to compute Xj for every block on all the disks, which should result in a random selection of blocks to be redistributed to the added disks; and if disks are removed, then RF() can apply a sequence of REMAP functions (from REMAP₀ to REMAPJ) to compute X_j of every block residing only on those removed disks, which should result in a random redistribution of blocks from the removed disks. Similarly, after scaling operation j, to find the location of block i, AF() can apply a sequence of REMAP functions (from REMAPO to REMAPJ) to compute Xj . Subsequently, RF ( ) and AF() compute the location of a block from its random number, Xj, such as by using Equation 1. That is, the sequence X₀,Xι,...,Xj can be used to determine the location of a block after each scaling operation. [0093] The redistribution techniques described can be referred to as SCADDAR (SCAling Disks for Data Arranged Randomly) . The objectives ROl, R02 and AOl for SCADDAR can be restated as follows. The REMAP functions should be designed such that:

^■ ROl: (Xj-i mod Nj-i) and (Xj mod N_j) should result in different disk numbers for Zj (see Equation 2 in ROl) blocks and not more.

^■ R02: For those Xj's that Dj-i≠Dj, there should be an equal probability that Dj is any of the newly added disks (in the case of addition operations), or any of the non-removed disks (in the case of removal operations) .

^■ AOl: The sequence X₀,Xι, ... ,Xj, and hence Dj can be generated with a low complexity.

[0094] The design of the REMAP function can determine the overall characteristics of a storage system using these techniques. In a bounded approach to the REMAP function, all of the objectives, ROl, R02 and AOl, are satisfied for up to k scaling operations. As a storage system approaches k scaling operations, the system can be taken offline and a full redistribution performed to get back to an original pseudorandom distribution state, where all final storage locations are based on X₀ values. In a randomized approach to the REMAP function, all of the objectives, ROl, R02 and AOl, can be satisfied no matter how many scaling operations are performed. The resulting storage system can be fully scalable. Moreover, the storage system can be taken offline, and a full redistribution performed, if and as desired, thus providing the system administrator with full flexibility and control.

[0095] The following description covers a bounded approach to the REMAP function. First, REMAPJ for deriving Xj after a disk group removal during the j^th operation is discussed. Next, REMAPJ for deriving Xj after a disk group addition during the j^th operation is discussed. In each case, X_j results after remapping Xj-i. The following definition,

Let q_j = (X_j div Nj) and rj = (X_j mod N_j),

serves as the underlying basis for computing REMAPJ in the discussion below.

[0096] In order to maintain a random distribution, X_j should have a different source of randomness from Xj-i. In bounded SCADDAR, REMAPJ uses (Xj-i div Nj_ι) as a new source of randomness even though this results in a smaller range. The shrinking range results in a threshold for the maximum number of performable scaling operations. Equation 3 below defines REMAPJ if scaling operation j is a removal of disk (s) :

(3) REMAP_j =

where case_a is if rj-i is not removed, and case_b is otherwise, and where the function new() maps from the previous disk numbers to the new disk numbers, taking into account gaps that might occur from disk removals. [0097] Xj is constructed to contain two retrievable pieces of information: 1) a new source of randomness used for future operations, and 2) the disk location of the block after the j^th operation. The new source of randomness is provided by qj-i. In case_a, the block remains in its current location, and thus Xj is constructed using the block's current disk location as the remainder as well as the new source of randomness as the quotient in case of future scaling operations. In case_b, the block is moved according to the new source of randomness.

[0098] For an addition of a disk group during operation j, a certain percentage of blocks are to be moved and are chosen at random depending on how many disks are being added. Again, a new range of random numbers should be used upon each add operation to maintain the overall random distribution. The new source of randomness is provided by (qj-i div Nj), which still has the shrinking range. Equation 4 below defines REMAPJ if scaling operation j is an addition of disk(s) :

{case a : (a, , div case_b : _y_, div

mod NJ where case_a is if (^_, mod N_J )< N₇-_₁ , and case_b is otherwise. As before, Xj is constructed to contain the new source of randomness as the quotient and the disk location of the block as the remainder. To uphold ROl, blocks are moved to new disks only if they are randomly selected for the new disk(s); that is, if (<jr_y_, mod N_; )> N _, for a particular block (i.e., case_b) , then that block is moved to an added disk during operation j, the target disk being packaged as the remainder of Xj after division by j . After simplifying terms for Equation 4, the result is Equation 5: _l - (q_J._l mod N )+ r _,

( 5 ) REMAP_j =

where case_a is if \g_j_ mod N )< N ._, , and case_b is otherwise.

[0099] All the objectives of RF() and AF() are met using the bounded SCADDAR approach. ROl is satisfied because only those blocks which need to be moved are moved. Blocks either move onto an added disk or off of a removed disk. R02 is satisfied because REMAPJ uses a new source of randomness to compute Xj . AOl is satisfied because block accesses may only require one disk access per block, and block location is determined through computation using inexpensive mod and div functions instead of disk-resident directories.

[00100] The following description covers a randomized approach to the REMAP function, which satisfies the objectives ROl, R02 and AOl, and can also allow scaling operations to be performed without bound; although in practice, the repeated mod and pseudorandom function calls may eventually require non-negligible computational resources. In a randomized approach, a pseudorandom number generator is used in conjunction with Xj as the new source of randomness after each scaling operation. The pseudorandom number generator can be same generator used in the original distribution of blocks, or it can be one or more alternative pseudorandom number generators. [00101] In the randomized SCADDAR approach described below, the same pseudorandom number generator is used throughout, and the current Xj is used to reseed the generator. Thus, p_r(X_j) provides the new source of randomness, and p_r ( ) should be such that any number in the range 0...R can be used as a seed. This can guarantee a b-bit number is used as the quotient for Xj, regardless of the number of scaling operations performed. The seed used for p_r ( ) and the random number returned are assumed to be independent for practical purposes here.

[00102] In randomized SCADDAR, Xj_ι is used as the seed of the pseudorandom number generator to obtain X_j . REMAPJ for randomized SCADDAR is constructed in a similar fashion as in bounded SCADDAR, except that p_r(Xj) is used as the quotient. Equations 6 and 7 define REMAPJ for a removal of disk(s) and an addition of disk(s), respectively: + new _j_ )

( 6 ) REMAPj =

where case_a is if rj_ι is not removed, and case_b is otherwise;

^{( 7 )} R^EMAPJ = *j _{mod Nj})

where case_a is if p _r X _j__x ) mod N )< N_y_, , and case_b is otherwise.

[00103] Let Xj denote the random number for a block after j scaling operations. A pseudorandom number generator may be considered ideal if all the Xj's are independent and they are all uniformly distributed between 0 and R. Given an ideal pseudorandom number generator, randomized SCADDAR is statistically indistinguishable from complete reorganization in terms of distribution. Although, actual pseudorandom number generators are unlikely to be ideal, simulation of randomized SCADDAR shows that the technique satisfies ROl, R02 and AOl for a large number of scaling operations. [00104] FIG. 2 is a block diagram illustrating an example operational environment. One or more clients 200 communicate with a storage system 220 over a network 210. The network 210 provides communication links and may be any communication network linking machines capable of communicating using one or more networking protocols, including a local area network (LAN) , metropolitan area network (MAN) , wide area network (WAN) , enterprise network, virtual private network (VPN), the Internet, etc. [00105] The storage system 220 may be any machine and/or system that stores information and is capable of communicating over the network 210 with other machines coupled with the network 210. The client (s) 200 may be any machines and/or processes capable of communicating over the network 210 with the storage system 220. For example, the storage system 220 may be a CM server (e.g., a video-on- demand system), and the client (s) 200 may be browser applications. The storage system 220 may also be an interactive visualization system, such as a scientific or entertainment visualization system, a file system and/or a database system. In general, the client (s) 200 request resources from the storage system 220, and the storage system 220 provides requested resources to the client (s) 200, if available.

[00106] The storage system 220 includes one or more controllers 230 and one or more storage devices 240. The controller (s) 230 may be configured to perform pseudorandom data placement and/or pseudorandom disk scaling, such as described above. The controller (s) 230 may be one or more processing systems, such as one or more general purpose computers programmed to perform the features and functions described. For additional details regarding various example implementations of the storage system 220, as well as other alternative innovations that may be used in the combined system, see FIGS. 3-18 and the corresponding description below.

[00107] The storage devices 240 may be any storage system that includes discrete storage media that can be accessed separately. The storage devices 240 may be a storage array or a RAID enclosure. The storage devices 240 may be memory devices, either non-volatile memories or volatile memories, or mass storage media (e.g., disk drives, or potentially separate disk zones on a platter) , which may be magnetic- based, optical-based, semiconductor-based media, or a . combination of these. Thus, a storage device 240 includes at least one machine-readable medium. As used herein, the term "machine-readable medium" refers to any computer program product, apparatus and/or device that could be used to provide information indicative of machine instructions and/or data to the system 220.

[00108] The systems and techniques described above are tailored to homogeneous and/or logical disks, and thus the storage devices 240 should have similar capacity and bandwidth characteristics. When heterogeneous disks are to be used, a mapping between the heterogeneous disks and the logical logical disks can be generated, and the techniques described above can be used with the logical disks. The data blocks can be stored on the heterogeneous disks based on the mapping from the logical disks to the heterogeneous disks.

[00109] Referring to FIG. 3, a continuous media system 300 is shown. A server 305 includes a number of nodes 310, with each node 310 coupled with one or more data storage devices such as storage disks 320. Each node includes one or more data processing devices, such as one or more CPUs or other data processing circuitry. [00110] For example, each node may include a module 311 to retrieve data segments from one of the associated storage disks 320 (e.g., a file input/output module), a module 312 to schedule transmission of the data segments to one or more clients 350 (e.g., a scheduler module), a module 313 to transmit the data segments to clients 350 (e.g., a real-time transport protocol (RTP) module) , and optionally a module 314 to provide control information for transmitting a data stream to clients 350 (e.g., a real-time streaming protocol (RTSP) module) , where the data stream includes a plurality of data segments stored among nodes 310. In some implementations, modules 311-315 may be implemented at least partially as software. The data segments may include all of a particular block of data stored on a node 310, or a portion of a particular block of data stored on a node 310-. [00111] Each node 310 may be coupled to a network 340. For example, each node 310 may include a network interface module 315 (e.g., a network interface card (NIC)) for coupling with a network communication device such as a network switch 330 to connect to clients 350 via network 340.

[00112] Referring to FIG. 4, according to a particular implementation, a continuous media system 400 includes a server 405. Server 405 includes four clustered nodes 410-A through 410-D, where each node includes a Dell PowerEdge 1550 Pentium III 866 MHz PC with 256 MB of memory running Red Hat Linux. The continuous media data are stored on four storage devices 420-A through 420-D, which are each 18 GB Seagate Cheetah hard disk drives connected to the server nodes 410-A through 410-D via Ultra 160 SCSI channels. [00113] Nodes 410-A through 410-D may communicate with each other and send media data via multiple 100 Mb/s Fast Ethernet Network Interface Card (NIC) connections. Server 405 may include a local network switch 430, which may be a Cabletron 6000 switch coupled with either one or two Fast Ethernet lines. Switch 430 is coupled with a network 440; for example, switch 430 is coupled with both a WAN backbone (to serve distant clients) and a LAN environment (to serve local clients) . An IP-based network may be chosen to keep the per-port equipment cost low and for easy compatibility with the public Internet.

[00114] Clients such as client 450 of FIG. 4 may be based on a commodity PC platform, and may run, e.g., Red Hat Linux or Windows NT. Client 450 need not be a PC or computer, but may be any device to receive continuous media data for- presentation to or use by a user. For example, client 450 may be a personal data assistant (PDA), and network 440 may be a wireless network.

[00115] Client 450 includes a controller module 451 to enable client 450 to request data and to receive data. For example, controller module 451 may include a Real Time Streaming Protocol (RTSP) controller and a Real-Time Transport Protocol (RTP) controller. Client 450 may also include a user interface 452, a client buffer 453, a playback module 454, and a media decoder 455. Decoder 455 may be coupled with one or more displays 460 and/or one or more speakers 470 for displaying video data and playing back audio data.

[00116] Referring to FIGS. 4 and 5,' buffer 453 may be a circular buffer with a capacity denoted by B. A buffer model 500 may include a number of thresholds 510 that may be used to regulate server transmission rates to ensure that buffer 453 neither underflows nor overflows. [00117] Buffer 453 reassembles variable bit-rate media streams from data included in packets that are received from the server nodes. Note that the data included in a packet need not be exactly a block of data stored on a particular server node. For example, in some implementations, a continuous media file may be stored among storage devices in blocks of data of a particular size, where the blocks may be significantly larger than the amount of data included in a packet. In such a case, multiple packets may be transmitted in order to transmit the entire block of data. [00118] If the server transmits packets at a greater rate than the client consumes them, buffer 453 may exceed its capacity; that is, it may overflow. If the client consumes packets faster than the server transmits them, buffer 453 may empty (underflow or starve) . Buffer underflow or overflow may lead to disruption of the presentation of the data to the user.

[00119] Server-controlled techniques may be used to smooth the consumption rate Re by approximating a number of constant rate segments. However, such algorithms implemented at the server side may need complete knowledge of R_c as a function of time.

[00120] To better enable work in a dynamic environment, a client-controlled buffer management technique may be used. Referring to FIG. 5, a multi-threshold buffer model 500 may be used with buffer 453 of FIG. 4. Buffer model 500 includes a plurality of buffer levels 510, including an overflow threshold 510-O, an underflow threshold 510-U, and may include a plurality of N intermediate thresholds 510-1 through 510-N. In order to avoid buffer underflow or overflow, the client uses one or more of thresholds 510 to determine an appropriate server sending rate, and then forwards server sending information to the server. Client- controlled buffer management techniques include pause/resume flow control techniques, and multi-threshold flow control techniques.

[00121] Pause/resume Flow Control

[00122] According to the pause/resume scheme, if the data in the buffer reaches threshold 510-O, the data flow from server 405 is paused. The playback will continue to consume data from buffer 453. When the data in buffer 453 reaches watermark 510-U, the delivery of the stream is resumed from server 405. If the delivery rate R_N of the data is set correctly, buffer 453 will not underflow while the stream is resumed. A safety margin in both watermarks- 510-O and 510-U may be set in order to accommodate network delays.

[00123] Multi-threshold Flow Control

[00124] The inter-packet delivery time Δr is used by schedulers included in nodes 410-A to 410-D to transmit packets to client 450. In an implementation, schedulers use the Network Time Protocol (NTP) to synchronize time across nodes 410-A through 410-D. Using a common time reference and the timestamp of each packet, nodes 410-A through 410-D send packets in sequence at Δr time intervals. Client 450 fine-tunes the Δr delivery rate by updating server 405 with new Δr values based on the amount of data in buffer 453.

[00125] Fine tuning may be accomplished, for example, by using one or more additional intermediate watermarks such as watermarks 510-1 and 510-N of FIG. 5. Whenever the level of data in buffer 453 reaches a watermark, a corresponding Δr speedup or slowdown command is sent, with the goal of preventing buffer starvation or overflow. Buffer 453 is used to smooth out any fluctuations in network traffic or server load imbalance, which could lead to display/playback disruptions. Thus, client 450 may control the delivery rate of received data to achieve smoother delivery, prevent bursty traffic, and keep a fairly constant buffer level. For additional details on systems and techniques that may be used for traffic smoothing, see FIGS. 8-18 and the corresponding description below.

[00126] Client software may need to work with a variety of media types. Client 450 may include a playback module

454. The playback thread interfaces with media decoder

455. Decoder 455 may be hardware and/or software based. [00127] For example, decoder 455 may include a CineCast hardware MPEG decoder, available from Vela Research. The CineCast decoder supports both MPEG-1 and MPEG-2 video, as well as two channel audio. Alternatively, for content including 5.1 channels of Dolby Digital audio (e.g., as used in DVD movies) , decoder 455 may include the Dxr2 PCI card from Creative Technology, which may be used to decompress both MPEG-1 and MPEG-2 video in hardware, as well as to decode MPEG audio and provide a 5.1 channel SP-DIF digital audio output terminal.

[00128] Decoder 455 may include a decoder called DivX;-) for decoding MPEG-4 media. MPEG-4 generally provides a higher compression ratio than MPEG-2. For example, a typical 6 Mb/s MPEG-2 media file may only require a 800 Kb/s delivery rate when encoded with MPEG-4. Using an implementation of a continuous media system where a client included the DivX;-) decoder, an MPEG-4 video stream was delivered at near NTSC quality to a residential client site via an ADSL connection.

[00129] High definition television (HDTV) clients present additional challenges. First, HD media require a high transmission bandwidth. For example, a video resolution of 1920 x 1080 pixels encoded via MPEG-2 results in a data rate of 19.4 Mb/s. Using an open source software decoder called mpeg2dec, frame rates of about 20 frames per second were obtained using a dual-processor 933 MHz Pentium III, using unoptimized code. Using a Vela Research Cinecast HD add-on board, full frame rate high definition video playback (e.g., 30 resp. 60 frames per second) were obtained at a data rate up to about 45 Mb/s. The examples given here are for illustrative purposes only; other decoders, frame rates, and data rates are possible. [00130] Multi-node Server Modes

[00131] Referring to FIGS. 6A and 6B, a continuous mode system such as system 300 of FIG. 3 or system 400 of FIG. 4 may run in two modes: master/slave mode (FIG. 6A) , or bipartite mode (FIG. 6B) . [00132] Master/Slave

[00133] One technique to enable a server application to access storage resources located on multiple nodes is to introduce a distributed file system. An application running on a specific node operates on all local and remote files via a network protocol to the corresponding node (for remote files) .

[00134] Referring to FIG. 6A, a client 650 sends a request for continuous media data to a server 602. A particular node such as a node 610-C is designed as a master node for providing the requested to client 650. In some implementations, each node 610 may be capable of acting as a master node, while in other implementations, fewer than all of the nodes 610 may be capable of acting as a master node.

If multiple nodes are capable of acting as a master node, one of the capable nodes is designated as the master node for a particular client request; for example, using a round- robin domain name service (RR-DNS) or a load-balancing switch.

[00135] For a particular request, the requested data may be distributed among the nodes 610-A through 610-D to maintain a balanced load. As described more fully below and also above in connection with FIGS. 1 and 2, a pseudorandom distribution may be used to distribute the data and to reduce the overhead required to store and retrieve the desired data. As a result, blocks of the requested data are generally distributed among each of the nodes 610-A through 610-D.

[00136] Master node 610-C brokers the client request to slave nodes 610-A, 610-B, and 610-D. A distributed file system application resident on the master node 610-C, which may include multiple input/output modules, requests and subsequently receives desired data from a distributed file system application resident on each of the slave nodes 610- A, 610-B, and 610-D. Additionally, a scheduler resident on master node 610-C schedules packet transmission to the client for all of the requested data. Thus, all of the data is channeled to client 650 through master node 610-C. [00137] Exemplary software for this technique includes two components: a high-performance distributed file system application, and a media streaming server application. The distributed file system may include multiple file input/output (I/O) modules located on each node. The media streaming server application may includes a scheduler, a real-time streaming protocol (RTSP) module, and a real-time protocol (RTP) module. In other implementations, other protocols may be used. Each node 610-A through 610-D runs the distributed file system, while at least some nodes such as node 610-C also run the media streaming server application.

[00138] A particular master server node such as node 610-C is a point of contact for a client such as client 650 during a session. A session may be a complete RTSP transaction for a continuous media stream. When a client requests a data stream using RTSP, it is directed to a master server node which in turn brokers the request to the slave nodes. [00139] An advantage of a distributed file system is that applications need not be aware of the distributed nature of the storage system. Applications designed for a single node may, to some degree, take advantage of the cluster organization. For example, a media streaming server application for implementing a master/slave mode may be based on the Darwin Streaming Server (DSS) project by Apple Computer, Inc. The media streaming server application assumes that all media data are located in a single, local directory. Enhanced with the distributed file system described here, multiple copies of DSS code (each running on its own master node) may share the same media data. This also simplifies client design, since all RTSP control commands may still be sent to only one server node. [00140] Although the master/slave configuration allows for ease of utilizing clustered storage, it may have a number of drawbacks. For example, the master node may become a bottleneck, the master node may be a single point of failure, and there may be heavy inter-node traffic. The master/slave configuration becomes less practical as the number of nodes and/or the number of storage devices is scaled up, since the master node must generally request and receive data from each storage device (for load balancing purposes) . For applications where the drawbacks may limit performance, the bipartite design below may be a better choice.

[00141] Bipartite

[00142] A bipartite configuration may be used rather than a master/slave configuration. In a bipartite configuration there are two groups of nodes, termed a server group and a client group.

[00143] Referring to FIG. 6B, a client 655 transmits a request for data to a server 604. Server 604 includes multiple nodes such as nodes 615-A through 615-D. Rather than having centralized scheduler, RTSP, and RTP server modules (as in the implementation of a master/slave configuration described above) , each node 615 may include a distributed file system, RTSP module, RTP server module, and scheduler.

[00144] In response to a client request for media data, one node (e.g., node 615-C in FIG. 6B) is designated to be the source of control information for providing the requested data to client 655. From the client's point of view, in an implementation using the RTSP and RTP protocols, only the RTSP module is centralized. The RTP application, schedulers, and File I/O modules operate on each node 615-A through 615-D. As a result, each node 615 may retrieve, schedule, and send local data blocks directly to the requesting client (again, note that packets of data transmitted from a node to a client may include less data than the block of data stored on the particular server node) . Therefore, there is no bottleneck of a master node, like there may be using the master/slave configuration. Additionally, inter-node traffic may also be significantly reduced using a bipartite configuration.

[00145] To implement a bipartite configuration, clients need to be able to receive the requested data from multiple nodes, as described below. Additionally, a distributed scheduler was developed to replace the DSS code used in the master/slave configuration. Further, a flow control mechanism was developed to reduce or eliminate the problem of client buffer overflow or starvation. [00146] In the bipartite configuration, each client maintains contact with one RTSP module for the duration of a session, for control related information. Each server node may include an RTSP module, and an RR-DNS or load-balancing switch may be used to decide which RTSP server to contact.- In this configuration, clients may communicate with individual nodes for retransmissions; thus, a simple RR-DNS may not be used to make the server cluster appear as one node. However, the bipartite configuration may be quite robust; if an RTSP server fails, sessions need not be lost. Instead, they may be reassigned to another RTSP server so the delivery of data is generally uninterrupted. [00147] An adapted MPEG-4 file format as specified in MPEG-4 Version 2 may be used for the storage of media blocks. The adaptation of the current system expanded on the MPEG-4 format by allowing compressed media data other than MPEG-4 (for example, MPEG-2) to be encapsulated. [00148] Flow Control

[00149] As described above, different flow control techniques may be used to vary the server transmission rate so that the client buffer neither overflows or underflows. These techniques include the above-described pause/resume and multi-threshold flow control techniques described above. [00150] Multi-stream Synchronization

[00151] Flow control techniques implemented in client-server communications protocol allow synchronization of multiple, independently stored media streams. Multi-stream synchronization may be important when, for example, video data and audio data are included in different streams and yet need to be synchronized during playback to the user. [00152] Referring to FIG. 7, a client configuration 700 is shown for an implementation including playback of panoramic, 5 channel video and 10.2 channel audio. The five video channels originate from a 560-degree video camera system such as the FullView model from Panoram Technologies. A first client 750-1 requests and receives the five video channels, where each video channel is encoded into a standard MPEG-2 program stream. First client 750-1 includes a SCSI card. A second client 750-2 requests and receives the 10.2 channels of high-quality, uncompressed audio. Here, the 0.2 of the 10.2 channels refers to two low- frequency channels for playback by, e.g., subwoofers. Second client 750-2 includes a sound card. Note that in other implementations, a single client may request and receive data streams for both video and audio. [00153] Precise playback may be achieved using three levels of synchronization: (1) block-level via retrieval scheduling, (2) coarse-grained via the flow control protocol, and (3) fine-grained through hardware support. The flow control protocol allows approximately the same amount of data to be maintained in the client buffers. The MPEG decoders may be lock-stepped to produce frame-accurate output using multiple CineCast decoders such as decoders 710-1 and 710-2, as well as a timing signal, which may be generated using a genlock timing signal generator device 720. The timing signal is provided to decoders 710-1 an 710-2 (which, in this implementation, include an external trigger input which allows for accurate initiation of playback through software) , as well as a trigger unit 730 for the audio data.

[00154] The audio data is provided to an audio system 740, including an audio digital to analog (D/A) converter 741, a pre-amplifier 742, an audio power amplifier 743, and speakers 744. Note that for 10.2 channel audio, speakers 744 include ten speakers and two subwoofers. The video data is provided to a video system 760, including a Panoram realtime video stitching equipment 761 and displayed using a head-mounted display 762, a multi-screen display 763, or one or more other displays.

[00155] As a result, during playback, all of the video streams are rendered in tight synchronization such that the five video frames that correspond to one time instance are accurately combined into a panoramic 3600 x 480 mosaic every 1/30 of a second. The audio playback (here, surround-sound audio) is presented phase-accurately and in synchronization with the video.

[00156] Although the previous example discusses five video channels and 10.2 audio channels, using a client with two 4- channel CineCast decoders and a client with a multi-channel soundcard, up to eight synchronous streams of MPEG-2 video and 16 audio channels have been rendered. Many other implementations are possible. [00157] Data Placement and Scheduling

[00158] Different techniques may be used to assign data blocks in the storage medium. For example, continuous media data may be stored in a magnetic disk drive according to a round-robin sequence or in a random manner. [00159] However, each of these techniques has one or more drawbacks. For example, round-robin placement makes scaling the system up difficult, since most of the data must be redistributed each time a new storage device is added. Additionally, the initial startup latency for an object might be large under heavy loads.

[00160] Using the random approach may reduce the startup latency, and may provide for a more balanced server load. However, the random approach may require storage of a large amount of meta-data: generally, the location of each block Xi is stored and managed in a centralized repository (e.g., tuples of the form <node_z,disk_y>) .

[00161] The current inventors recognized that by using a pseudorandom block placement, many advantages of the random approach may be obtained, while the disadvantages may be mitigated. With pseudorandom number generators, a seed value initiates a sequence of random numbers. Such a sequence is pseudorandom because it can be reproduced if the same seed value is used. Therefore, using a pseudorandom approach only a seed for each file object is stored, rather than the location of every block. Block locations can always be recomputed, using the stored seed value. Further, since the numbering of the disks is global across the server nodes, blocks will be assigned to random disks across different nodes.

[00162] For additional details on pseudorandom block placement, please see the above-referenced U.S. Patent Application entitled "PSEUDORANDOM DATA STORAGE." [00163] Scalability, Heterogeneity, and Fault-Resilience [00164] The continuous media system described herein is scalable, heterogeneous, and fault resilient. Scalability refers to the ease with which the capacity of a system may be changed. Usually, it refers to the ease with which the capacity may be increased to satisfy growth in user demand and/or increased application demands. Heterogeneity refers to the even distribution of data across server nodes. Fault-resilience refers to the ability of a system to overcome a fault within the system. [00165] The current system may provide for enhanced scalability over prior systems. First, using the pseudorandom block placement method, adding more storage to the system entails moving only a fraction of the stored data. In contrast, when adding or removing a disk in a system using round-robin striping, almost all of the data blocks may need to be relocated. Further, only the new seed may need to be stored. In contrast, the random technique may require storing meta-data for the position of each block.

[00166] Scalability may also be enhanced by using the bipartite mode described herein. Using the bipartite mode, the number of nodes included in a server may be larger than the number of nodes that may be practically in a master/slave mode. As stated above and illustrated in FIG. 6A, operating a continuous media system using the master/slave mode requires inter-node communication. As the number of nodes is increased, the amount of inter-node communication increases. At some point, the amount of inter-node traffic will exceed the ability of the system to provide the requested data to the client in a timely manner. [00167] In addition, the continuous media system illustrated in FIG. 3 provides a modular design that may easily be expanded. Rather than a single storage device, such as a magnetic disk, multi-disk arrays may be employed. Additionally, multiple nodes may be used, where commodity personal computers (PCs) may be used for one or more of the nodes. As the capability of commodity PCs increases with time, the older PCs may be easily replaced with newer PCs. This modular architecture is both scalable and cost- effective .

[00168] To improve fault-resilience of the current system, a parity-based data redundancy scheme may be used. Using a continuous media system such as system 300 of FIG. 3, a distributed file system may provide a complete view of all data on each node, without the need to replicate individual data blocks. However, in an application where reliability is important, data redundancy may be improve the system's ability to provide continuous media data to clients twenty four hours a day.

[00169] The data redundancy scheme may take advantage of a heterogeneous storage subsystem through a technique called disk merging. Disk merging presents a virtual view of logical disks on top of the actual physical storage system which may include disks with different bandwidths and storage space. The system's application layers may then assume a uniform characteristic for all of the logical disks. Using this abstraction, conventional scheduling and data placement algorithms may be used. [00170] RTP/UDP and Selective Retransmission [00171] A continuous media system such as system 300 of FIG. 3 may support industry standard real-time protocol (RTP) for the delivery of time-sensitive data. Because RTP transmissions are based on the best effort User Datagram Protocol (UDP) , a data packet could arrive out of order at the client or be altogether dropped along the network. To reduce the number of lost RTP data packets, a selective retransmission protocol may be implemented. For example, the protocol may be configured to attempt at most one retransmission of each lost RTP packet only if the retransmitted packet would arrive in time for consumption. [00172] In a continuous media system operating in the bipartite mode described above, an additional problem may arise. If a data packet does not arrive, the client may not know which server node attempted to send it. That is, the client may not know where to direct a retransmission request. Solutions to this problems include having the client compute which server node transmitted the lost packet, as well as having the client broadcast the retransmission request to all the server nodes. [00173] Broadcast Approach

[00174] Rather than sending the retransmission request to a particular node, the request may be broadcast. Broadcasting the packet retransmission request to all of the server nodes generally places less load on the client. Using this technique, the client does not need to determine which node transmitted the lost packet; instead, each of the nodes receive the request, check whether they hold the packet, and either ignore the request or perform a retransmission. Thus, the client remains unaware of the server sub-layers. However, the broadcast approach may waste network bandwidth and increase server load. [00175] Unicast Approach

[00176] A unicast retransmission technique may be more efficient and more scalable than the broadcast technique. In order to send a retransmission request to the appropriate node only, a method of identifying the node is needed. Different methods may be used to identify the appropriate node.

[00177] First, when the continuous media system uses pseudorandom block placement as described above, the client may regenerate the pseudorandom number sequence and thereby determine the appropriate node. Thus, the client may use a small amount of meta-data and bookkeeping to send retransmission requests to the specific server node possessing the requested packet.

[00178] However, this approach may be difficult to implement from a practical standpoint. For example, upgrading server software may require an update of client software on perhaps thousands of clients as well. Additionally, when the system is scaled up or down (i.e., a node is added to or removed from the system), new parameters (e.g., seed numbers for the pseudorandomly distributed data) may need to be propagated to the clients immediately so that the appropriate server node can be correctly identified. Additionally, if the client computation is ahead or behind the server computation (e.g., the total number of packets received does not match the number of packets sent), then future computations will generally be incorrect. This may happen, for example, if the client has a limited memory and packets arrive sufficiently out of sequence.

[00179] An alternative approach follows. Referring to FIG. 8, a process 800 for transmitting portions of a data stream to a client in a sequence includes assigning a node-specific packet sequence number, referred to as a local sequence number (LSN) to a packet (810) , in addition to the global sequence number (GSN) . The client stores the LSN values for received packets (820) , and subsequently determines whether there is a gap in the sequence of LSN (830) . If a gap exists, the client determines the identity of the particular server node that transmitted the lost packet using the missing LSN (840) . Subsequently, the client sends a retransmission request to the particular server node (850) . [00180] As mentioned previously, server-controlled or client-controlled algorithms may be used for transmission control of streaming continuous media data. Server- controlled techniques may have several disadvantages. For example, they may not work with live streams where only a limited rate history is available. Additionally, they may not adjust to changing network conditions, and they may get disrupted when users invoke interactive commands such as pause, rewind, and fast forward.

[00181] Client-controlled algorithms may be a better choice in a dynamic environment. A client-controlled technique may more easily adapt to changing network conditions. In addition, a simpler and more flexible architecture may be used, since the server does not need to be aware of the content format of the stream. Therefore, new media types such as "haptic" data can automatically be supported without modification of the server software.

[00182] However, available client-controlled techniques have a number of drawbacks, including feedback overhead and response delays. Available techniques may not adapt sufficiently quickly to avoid buffer starvation or overflow. [00183] The present application is directed to systems and techniques for providing continuous media data to end users effectively. Using multi-threshold flow control (MTFC) , the current systems and techniques may avoid buffer overflow or starvation, even when used in a bursty environment such as a VBR environment. Unlike some available techniques, a-priori knowledge of the actual bit rate for the media stream is not necessary.

[00184] The current systems and techniques may be implemented in LAN, WAN, or other network environments, with a range of buffer sizes and prediction windows. Referring to FIG. 9, a system 900 includes a server 910, a network 920, and a client 930. For a continuous media system that may be used, see FIGS. 3-8 and the corresponding description above. For information on a data placement technique that may be used with the current systems and techniques, see FIGS. 1-2 and the corresponding description above. [00185] Client 930 includes a client buffer 940, with a capacity equal to B. Client buffer 940 may be used to store data prior to decoding and display/playback. For example, when client 930 is receiving streaming video data from server 910, buffer 940 stores data to be subsequently decoded by a media decoder and displayed to the user. If buffer 940 overflows, some of the data may be lost, and there may be a "hiccup" in the display. Similarly, if buffer 940 empties (i.e. "starves,") there may be a hiccup in^' the display until additional data is received. Therefore, managing the buffer level is important to providing a high quality display or playback to an end user. [00186] Client 930 (and/or associated machines) also includes circuitry and/or software for implementing multi- threshold flow control. For example, client 930 can receive streaming media data from the server (i.e., client 930 has a network connection) , can store at least some of the data in buffer 940 prior to decoding using a decoder (e.g., implemented in hardware and/or software) , can determine the buffer level at different times, and can determine whether one or more thresholds has been passed since a previous determination of a buffer level. Client 930 also includes circuitry and/or software to implement a prediction algorithm, and to determine a new server sending rate and/or rate change, and to transmit the server transmission information to server 910.

[00187] Similarly, server 910 (and/or one or more associated machines) includes circuits and/or software for transmitting continuous media data to one or more clients, for receiving communications from the one or more clients, and for updating a server transmission rate based on transmission information contained in a communication from the one or more clients. [00188] Buffer Model

[00189] Referring to FIG. 10, a model 1000 of buffer includes a number of watermarks 1010. Watermarks 1010 include an underflow protection watermark 1010-U and an overflow protection watermark 1010-O. Watermarks 1010 include one or more intermediate watermarks Wi such as watermarks 1010-1 to 1010-N of FIG. 10.

[00190] Watermark 1010-U is set at an underflow threshold protection level; that is, a percentage of the buffer capacity that indicates that buffer starvation may be imminent. Watermark 1010-1 marks a low buffer warning threshold. When the buffer level falls below watermark 1010-1, the buffer is nearing starvation. [00191] Similarly, watermark 1010-O is set at an overflow threshold protection level; that is, a percentage of the buffer capacity that indicates that buffer overflow may be imminent. The overflow threshold protection level may be the same as or different than the underflow threshold protection level. Watermark 1010-N is the overflow buffer warning threshold. [00192] Using a model such as buffer model 1000 may allow smooth streaming of media data from the server to the client. The number of intermediate watermarks N may be varied to provide greater control over the buffer level

(larger N) or to provide less control with fewer rate adjustments (smaller N) . The watermark spacing may be equidistant, based on an arithmetic series (see FIG. 11A) , based on a geometric series (see FIG. 11B) , or may be set using a different method.

[00193] FIG. 11C shows the normalized throughput obtained using each of the three above-mentioned spacing methods.

Generally, the performance of each of the spacing methods is somewhat similar, with some poor results for particular combinations of buffer size and spacing method (e.g., for an

8 MB buffer with 17 thresholds, geometric spacing underperforms both equidistant and arithmetic spacing) .

[00194] For equidistant spacing, the underflow and overflow thresholds may first be determined. For example, the underflow threshold may be set as 5% of the buffer capacity, and the overflow threshold may be set as 95% of the buffer capacity.

[00195] The number of intermediate watermarks N may be chosen (e.g., selected or predetermined), with N ≥ 1. More typically, the number of intermediate watermarks is greater than one, for smoother traffic (see, e.g., FIG. 15 and the related discussion, below) . Denoting the threshold for the underflow watermark as W₀, the threshold for the overflow watermark as W₀, the thresholds Wi for each of the intermediate watermarks i = 1 through i = N are then given by:

W - W

[00196] W_i = W_f/ + ix —- ^- Equation (8) ^u N + \ [00197] Transmission Smoothing

[00198] Whenever a threshold Wi is crossed, a new server sending rate may be calculated, and a rate adjustment (or equivalently, a new server sending rate) may be sent to the server. For example, when the RTSP protocol is used, the information may be sent to the server using an RTSP feedback command. The action taken may depend on which threshold has been crossed. For example, if the W₀ or W_π thresholds are crossed, more aggressive action may be taken than if one of the intermediate thresholds Wi is crossed. Similarly, if the warning thresholds Wi or W_N are crossed, the action taken may be more aggressive than if a different intermediate threshold Wi had been crossed, but may be less aggressive than if W₀ or W₀ had been crossed. [00199] For example, if the W₀ threshold is exceeded, the server may be paused (i.e., the sending rate may be set to zero) , or its sending rate substantially decreased. The server may remain paused until the buffer level crosses a particular threshold or a particular value (e.g., the N/2 threshold, or the mid-point of the buffer capacity) . [00200] Similarly, if the buffer W₀ threshold is crossed, the server sending rate may be increased substantially; for example, it may be increased to about one and a half times the average server sending rate until the buffer level reaches a particular value or threshold. When the intermediate thresholds are crossed, new server sending information may be determined by choosing particular rate change amounts or by calculating new server sending information as described below.

[00201] In a simple system, the rate change amounts may be predetermined. For example, in an implementation with five intermediate thresholds Wι-W₅, the interval between packets may be set to 20% less than a default interval for Wi, to 10% less than a default interval for W₂, to the default interval for W₃, to 10% greater than the default interval for W₄, and for 20% greater than the default interval for W₅.

[00202] Referring to FIG. 12, method steps 1200 for implementing multi-threshold transmission smoothing are shown. Overflow and underflow thresholds may be chosen (1210) . The number and spacing of intermediate watermarks may be selected (1220) . The spacing may be selected according to one of the spacing schemes described above, or may be set in a different way (e.g., chosen). [00203] In operation, the server transmits streaming media data to a client at a server transmission rate (1230) . The client receives the streaming media data and stores at least some of the data in a buffer prior to decoding (1240) . At intervals, the buffer level is determined and compared to the previous buffer level to determine whether one or more thresholds has been crossed (1250) . If a threshold has been crossed, server transmission information (e.g., a new server transmission rate and/or a rate change) is calculated (1260), and if it is different from the previous server transmission rate, the server transmission information is communicated to the server (1270) . Note that the method steps of FIG. 12 need not be performed in the order given. [00204] Techniques to provide data transmission smoothing may use a number of different components and variables. Table 2 includes a list of the parameters used herein.

Table 2

[00205] Rate Change Computation

[00206] In order to determine an amount by which the server sending rate may be adjusted, the server sending rate, the decoder consumption rate, and the buffer level may be sampled at time intervals equal to Δt_obs. If the observed buffer level b₀b_s crosses any of the thresholds Wi, a new server sending rate is computed using Equation (9A) below, and the related rate change Δr is shown in Equation (9B). l - Kcb, + - (^Swob, ^{X d} feedback )

[00207] s = ^■ Equation ( 9A) prree.d' ^X Δ'oofcbi ' feedback

[00208] Δr = l - Equation ( 9B ) wobs [00209] Equation (10) below shows how C is related to the predicted future consumption rates r..(the prediction of future consumption rates is discussed more fully below) :

[00210] C = ∑(η At _obs ) Equation (10)

(=1

[00211] When crossing the thresholds Wi and W_N, the computed rate change Δr may not be sufficient to avoid reaching W₀ and W₀, respectively, due to the error margin of the prediction algorithms. Although the error margin may be reduced, doing so adds computational complexity that may not be desired in certain situations.

[00212] An alternative is to add or subtract a mean absolute percentage error (MAPE) from r₍. , as shown in Equations (4A) and (4B) . Equation (11A) shows how an adjusted η may be calculated when W_N is reached, while Equation (11B) shows how an adjusted η may be calculated when Wi is reached.

[00213] readjusted) = r, x(1 -MAPE) Equation (11A)

[00214] readjusted) = η x(1+ MAPE) Equation (11B)

[00215] Equation (12) shows how a MAPE value may be computed. In Equation (12), P is the number of prediction samples up to the current prediction time.

[00216] MAPE = [ — E|r, -r,| Equation (12)

[00217] Consumption Rate Prediction

[00218] Rather than requiring knowledge of the bit rate of the media stream prior to transmission, the current systems and techniques predict a consumption rate, so that live streams (e.g., streams that are being produced and transmitted as the events they depict—such as a live concert or distance learning session—occur) may be provided to end users.

[00219] Consumption rate prediction may observe the w_obs most recent rate samples to predict w_pred samples into the future.

For example, if w_obs=10 and w_pred=2, the 10 previous rate samples may be used to predict the rate 2 samples into the future. The observation window R includes w₀bs previous rate values <r , r₂, ..., r₀bs / while prediction window R includes the Wp_red predicted rate values < r, , r₂ , ..., r_wpred > . The estimated future rate is denoted f .

[00220] Prediction algorithms may be based on a number of different schemes. For example, an average consumption rate algorithm may be used, an exponential average algorithm may be used, or a fuzzy exponential average algorithm may be used.

[00221] An average consumption rate algorithm may predict the average consumption rate of the prediction window R using an average consumption rate of the observation window R, according to Equation (13):

[00222] r Equation (13)

[00223] An exponential average consumption rate algorithm may be used to give more weight to some samples in the observation window than to others. A smoothed consumption rate parameter for i=l is set to χ_f while the remainder of the SCR[i] are given by Equation (14) below, where α_cr is a weighting parameter.

[00224]

Equation (14)

[00225] The estimated future rate is then given by Equation (15) below. [00226] r = SCR[w_ob5 + l] Equation (15) [00227] There are two variations in applying this algorithm to forecast the future consumption rates during the prediction window R . The first variation, which will be referred to as the "expanding window exponential average algorithm," predicts η based on an increasing window < R , r, , r₂ , ..., r₍._, > using Equation (15). The expanding window exponential average algorithm increases the window size by one sample each time a new r. is generated. The second variation, which will be referred to as the "sliding window exponential average algorithm," keeps the window size constant and slides the observation window R forward when a new r. is generated.

[00228] A fuzzy exponential average consumption rate algorithm may be used to generate the η by combining a fuzzy logic controller with the window exponential average algorithm. Using the fuzzy exponential average algorithm, the parameter α_cr is dynamically calculated. [00229] The parameter α_cr controls the weight given to different samples. When _cr is large, more weight is given to past samples. When α_cr is small, more weight is given to the more recent samples. Therefore, if the variability in the consumption rate in the system is small (i.e., the bit rate of the stream is fairly constant), the prediction error should be small, and a large _cr may be used. On the other hand, if the variability is large (e.g., the stream is bursty) , a small _cr is appropriate, so that more recent sample data is weighted more heavily.

[00230] Referring to FIG. 13, a schematic 1300 of the fuzzy exponential average algorithm is shown. Variability information 1310 is provided to a fuzzy logic controller 1320. Fuzzy logic controller 1320 produces weighting factor α_cr 1330 based on the variability information. The value for _cr 1330 and observation window information 1340 is provided to the exponential average prediction algorithm 1350, which outputs the prediction window information 1360. [00231] The variability of a stream may be characterized by a normalized variance var, calculated according to Equation (16) below.

[00232] Equation ( 16)

[00233] Referring to FIGS. 14A and 14B, membership functions for the variable var (FIG. 14A) and the variable _cr (FIG. 14B) are shown. The following fuzzy control rules may be applied to the low, medium, and high regions of var data and α_cr data. If var is low, then _cr s high. If var is high, then α_cr is low. If var is medium, then α_cr is medium. Of course, more complicated schemes may be used. [00234] Feedback Message Delay

[00235] The round-trip feedback message delay (dfeedback) is an important factor in the transmission rate smoothing. The delay may be configured to be a conservatively estimated constant delay, or may be based on one or more measurements.

The delay may be estimated dynamically, based on a prediction algorithm, to more closely reflect the transmission delay in the network.

[00236] The systems and techniques described herein can provide a number of benefits for streaming media data. Referring to FIG. 15, the normalized throughput standard deviation is shown for transmission of media data for the movie Twister, using a prediction window size of 90 seconds, three different buffer sizes (8, 16, and 32 MB), and for six different transmission methods. In the first (bar to the far right), complete rate information about the bit rate is known. In the second through fifth, MTFC was implemented with different numbers of intermediate thresholds. In the last, no smoothing was implemented.

[00237] As FIG. 15 demonstrates, using MTFC according to the current systems and techniques provides a significant benefit over unsmoothed transition of streaming data, with larger numbers of thresholds generally corresponding to smoother traffic. Additionally, FIG. 15 also demonstrates that providing a larger buffer provides for smoother traffic.

[00238] FIGS. 16A through 16C show the effect of varying the prediction window, for buffer sizes of 8 MB, 16 MB, and 32 MB, and for various numbers of intermediate thresholds. In FIGS. 16A-16C, larger prediction windows provide smoother traffic for larger buffer sizes, but not for smaller buffer sizes. This may be due to the fact that, for the same number of thresholds, the change in buffer level that triggers a change in server sending rate is much smaller (e.g., for a set number of thresholds, the "distance" between the thresholds in an 8 MB buffer is about 1/4 of the distance between the thresholds in a 32 MB buffer) . Therefore, with a larger buffer (with more distance between thresholds) there may be longer segments at a constant rate than with a smaller buffer having the same number of thresholds.

[00239] Feedback messages from the client to the server introduce overhead. In order to reduce consumption of network resources for control purposes, the overhead may be reduced by reducing the number of rate changes. However, there may be a trade-off between the number of rate changes and the smoothness of the traffic.

[00240] Referring to FIG. 17, an increase in the number of thresholds for a particular buffer size increases the number of rate changes (but also may provide for smoother traffic- see, e.g., FIG. 15). Additionally, for the same number of thresholds, larger buffers have fewer rate changes. Referring to FIGS. 18A-18C, a longer prediction window generally results in fewer rate changes. [00241] Various implementations of the systems and techniques described here may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits) , computer hardware, firmware, software, and/or combinations thereof. These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, configured to receive and/or transmit data and instructions, at least one input device, and at least one output device.

[00242] The various implementations described above have been presented by way of example only, and not limitation. For example, the logic flows depicted in figures do not require the particular order shown, or sequential order, to achieve desirable results. The particular hardware and/or software discussed here is only exemplary. The number of nodes, the node architecture, the amount of memory, the type and capacity of storage, and the operating system may be different. Different schedulers, decoders, media types, and/or flow control schemes may be used. Different client types may be used. Moreover, different buffer sizes, threshold numbers, prediction window sizes, etc. may be used. Other embodiments may be within the scope of the following claims.

Claims

WHAT IS CLAIMED IS:

1. A method comprising: distributing data blocks over multiple storage devices according to a reproducible pseudorandom sequence that provides load balancing across the storage devices; and determining current storage locations of the data blocks by reproducing the pseudorandom sequence.

2. The method of claim 1, further comprising, in response to initiation of a storage scaling operation, redistributing a selected subset of the data blocks and saving information describing the storage scaling operation, wherein said determining current storage locations comprises computing the current storage locations of the data blocks based on the reproduced pseudorandom sequence and the saved scaling operation information.

3. The method of claim 2, further comprising accessing the data blocks according to the determined current storage locations.

4. The method of claim 2, wherein the storage scaling operation comprises addition of one or more storage devices, and redistributing the selected subset of the data blocks comprises: determining new storage locations for the data blocks based on the addition of the one or more storage devices; and moving the selected subset of the data blocks that have determined new storage locations on the one or more added storage devices.

5. The method of claim 2, wherein the storage scaling operation comprises removal of one or more storage devices, and said redistributing the selected subset of the data blocks comprises: determining new storage locations for the selected subset of the data blocks that reside on the one or more storage devices based on the removal of the one or more storage devices; and moving the selected subset of the data blocks based on the determined new storage locations.

6. The method of claim 2, further comprising transmitting data comprising at least one of the data blocks being accessed according to the determined current storage locations, wherein said redistributing the selected subset of the data blocks comprises, during said transmitting: copying the selected subset of the data blocks to newly determined storage locations on a new set of storage devices comprising at least one of the multiple storage devices; switching to the new set of storage devices; and deleting unused data blocks during idle times of said transmitting.

7. The method of claim 2, wherein said redistributing the selected subset of the data blocks comprises determining new storage locations based on the pseudorandom sequence used as input to a pseudorandom number generator.

8. The method of claim 7, wherein the pseudorandom number generator is used to generate the pseudorandom sequence.

9. The method of claim 8, further comprising generating the pseudorandom sequence by seeding the pseudorandom number generator with a number derived from an object name corresponding to the data blocks.

10. The method of claim 7, further comprising providing continuous media data to clients.

11. A machine-readable medium embodying information indicative of instructions for causing one or more machines to perform operations comprising: distributing data blocks over multiple storage devices according to a reproducible pseudorandom sequence that provides load balancing across the storage devices; and determining current storage locations of the data blocks by reproducing the pseudorandom sequence.

12. The machine-readable medium of claim 11, wherein the operations further comprise, in response to initiation of a storage scaling operation, redistributing a selected subset of the data blocks and saving information describing the storage scaling operation, and wherein said determining current storage locations comprises computing the current storage locations of the data blocks based on the reproduced pseudorandom sequence and the saved scaling operation information.

13. The machine-readable medium of claim 12, wherein the operations further comprise accessing the data blocks according to the determined current storage locations.

14. The machine-readable medium of claim 12, wherein the storage scaling operation comprises addition of one or more storage devices, and redistributing the selected subset of the data blocks comprises: determining new storage locations for the data blocks based on the addition of the one or more storage devices; and moving the selected subset of the data blocks that have determined new storage locations on the one or more added storage devices.

15. The machine-readable medium of claim 12, wherein the storage scaling operation comprises removal of one or more storage devices, and said redistributing the selected subset of the data blocks comprises: determining new storage locations for the selected subset of the data blocks that reside on the one or more storage devices based on the removal of the one or more storage devices; and moving the selected subset of the data blocks based on the determined new storage locations.

16. The machine-readable medium of claim 12, wherein the operations further comprise transmitting data comprising at least one of the data blocks being accessed according to the determined current storage locations, and wherein said redistributing the selected subset of the data blocks comprises, during said transmitting: copying the selected subset of the data blocks to newly determined storage locations on a new set of storage devices comprising at least one of the multiple storage devices; switching to the new set of storage devices; and deleting unused data blocks during idle times of said transmitting.

17. The machine-readable medium of claim 12, wherein said redistributing the selected subset of the data blocks comprises determining new storage locations based on the pseudorandom sequence used as input to a pseudorandom number generator.

18. The machine-readable medium of claim 17, wherein the. pseudorandom number generator is used to generate the pseudorandom sequence.

19. The machine-readable medium of claim 18, wherein the operations further comprise generating the pseudorandom sequence by seeding the pseudorandom number generator with a number derived from an object name corresponding to the data blocks.

20. The machine-readable medium of claim 17, wherein the operations further comprise providing continuous media data to clients.

21. A method comprising: distributing data blocks over multiple storage devices according to a reproducible pseudorandom sequence; in response to initiation of a storage scaling operation, pseudorandomly redistributing a selected subset of the data blocks and saving information describing the storage scaling operation; determining current storage locations based on the pseudorandom sequence and the saved scaling operation information; and accessing the data blocks according to the determined- current storage locations.

22. The method of claim 21, wherein said pseudorandomly redistributing comprises: seeding a pseudorandom number generator, used to generate the pseudorandom sequence, with one or more numbers from the pseudorandom sequence; and determining one or more new storage locations based on output of the pseudorandom number generator.

23. The method of claim 21, wherein pseudorandomly redistributing comprises pseudorandomly redistributing the selected subset of the data blocks while transmitting data comprising at least one of the data blocks being accessed according to the determined current storage locations.

24. The method of claim 21, wherein the storage scaling operation comprises addition of one or more storage devices, and pseudorandomly redistributing the selected subset of the data blocks comprises: determining new storage locations for the data blocks based on the addition of the one or more storage devices and based on output of a pseudorandom number generator seeded with one or more numbers from the pseudorandom sequence; and moving the selected subset of the data blocks that have determined new storage locations on the one or more added storage devices.

25. The method of claim 21, wherein the storage scaling operation comprises removal of one or more storage devices, and pseudorandomly redistributing the selected subset of the data blocks comprises: determining new storage locations for the selected subset of the data blocks that reside on the one or more storage devices based on the removal of the one or more storage devices and based on output of a pseudorandom number generator seeded with one or more numbers from the pseudorandom sequence; and moving the selected subset of the data blocks based on the determined new storage locations.

26. The method of claim 21, further comprising generating the pseudorandom sequence by seeding a pseudorandom number generator with a number derived from an object name corresponding to the data blocks.

27. The method of claim 21, further comprising providing continuous media data to clients, said providing continuous media data comprising said accessing the data blocks .

28. A machine-readable medium embodying information indicative of instructions for causing one or more machines to perform operations comprising: distributing data blocks over multiple storage devices according to a reproducible pseudorandom sequence; in response to initiation of a storage scaling operation, pseudorandomly redistributing a selected subset of the data blocks and saving information describing the storage scaling operation; determining current storage locations based on the pseudorandom sequence and the saved scaling operation information; and accessing the data blocks according to the determined current storage locations.

29. The machine-readable medium of claim 28, wherein said pseudorandomly redistributing comprises: seeding a pseudorandom number generator, used to generate the pseudorandom sequence, with one or more numbers from the pseudorandom sequence; and determining one or more new storage locations based on output of the pseudorandom number generator.

30. The machine-readable medium of claim 28, wherein pseudorandomly redistributing comprises pseudorandomly redistributing the selected subset of the data blocks while transmitting data comprising at least one of the data blocks being accessed according to the determined current storage locations .

31. The machine-readable medium of claim 28, wherein the storage scaling operation comprises addition of one or more storage devices, and pseudorandomly redistributing the selected subset of the data blocks comprises: determining new storage locations for the data blocks based on the addition of the one or more storage devices and based on output of a pseudorandom number generator seeded with one or more numbers from the pseudorandom sequence; and moving the selected subset of the data blocks that have determined new storage locations on the one or more added storage devices.

32. The machine-readable medium of claim 28, wherein the storage scaling operation comprises removal of one or more storage devices, and pseudorandomly redistributing the selected subset of the data blocks comprises: determining new storage locations for the selected subset of the data blocks that reside on the one or more storage devices based on the removal of the one or more storage devices and based on output of a pseudorandom number generator seeded with one or more numbers from the pseudorandom sequence; and moving the selected subset of the data blocks based on the determined new storage locations.

33. The machine-readable medium of claim 28, wherein the operations further comprise generating the pseudorandom sequence by seeding a pseudorandom number generator with a number derived from an object name corresponding to the data blocks.

34. The machine-readable medium of claim 28, wherein the operations further comprise providing continuous media data to clients, said providing continuous media data comprising said accessing the data blocks.

35. A system comprising: one or more storage devices; and one or more controllers configured to pseudorandomly place data blocks on the one or more storage devices, to perform pseudorandom scaling of the one or more storage devices, and to access the data blocks based on information describing prior pseudorandom scaling.

36. The system of claim 35, further comprising a continuous media server comprising the one or more controllers .

37. The system of claim 36, wherein the one or more storage devices comprise two or more hard drives.

38 . A system comprising : means for randomized storing of data blocks without maintaining a directory system identifying locations of all the data blocks; and means for randomized redistributing of the data blocks such that block movement is minimized.

39. The system of claim 38, further comprising means for transmitting data comprising the data blocks, wherein said means for transmitting and said means for randomized redistributing operate simultaneously.

40. The system of claim 39, wherein the means for randomized storing and the means for randomized redistributing use a single pseudorandom number generator.