WO2011002451A1

WO2011002451A1 - Optimizing file block communications in a virtual distributed file system

Info

Publication number: WO2011002451A1
Application number: PCT/US2009/049256
Authority: WO
Inventors: Michael Rhodes; Russell Perry; Eduardo Ceballos
Original assignee: Hewlett-Packard Development Company, L.P.
Priority date: 2009-06-30
Filing date: 2009-06-30
Publication date: 2011-01-06

Abstract

In a method for optimizing a scheduling of file block communications for a client from a plurality of data nodes in a virtual distributed file system, in which the plurality of data nodes contain redundant copies of the file blocks, a schedule for communicating the file blocks substantially concurrently from multiple ones of the plurality of data nodes to the client that minimizes the amount of time required to communicate the file blocks is derived.

Description

OPTIMIZING FILE BLOCK COMMUNICATIONS IN A VIRTUAL DISTRIBUTED

FILE SYSTEM

BACKGROUND

[0001] A virtual distributed file system may be defined as a file system that is overlaid on a set of other file systems belonging to different file system storage nodes (hereinafter referred to as "nodes"). In a virtual distributed file system, files are split into blocks and the blocks are then distributed across the file systems of participating nodes. A metadata server then records where the blocks are located and is responsible for directing where file blocks are written. An example of a virtual distributed file system is the Hadoop Distributed File System (www.hadoop.apache.org/core/docs/current/hdfs_design.html).

[0002] Multiple replicas of each block are typically written to separate nodes for redundancy and can be used to improve read rates from the hosts by utilizing parallelism. The read rates are often improved because when a client reads the output file(s), the client typically reads files in parallel from several different nodes, which may include, for instance, computer servers. In reading the output file(s) from the multiple nodes, the client often uses a number of threads of execution, or readers, and a certain number of file blocks are buffered in memory. Buffering of the file blocks enables the file blocks to be read out of order, while allowing them to still be written to an output stream in the correct order. To limit the size of the buffer, blocks cannot be read in just any order. It is preferable to read a block a short time before it will need to be written to the output stream, to thus minimize the time that the block occupies space in the buffer.

[0003] Because the file blocks are redundantly replicated across multiple nodes, the client is able to read different blocks from the same file at the same time from different nodes. This simultaneous reading of the file blocks allows a higher speed output stream to be generated even in situations where the nodes have relatively slow performance disks. This arrangement effectively enables multiple low speed storage reads to be multiplexed into a single high speed output stream by the client. In addition, the maximum output stream rate speed up over single read is upper-bounded by the number of reader threads that the client uses. [0004] However, reading the output files from a single location to form a single data stream for further processing has been found to be problematic. This has been found to occur either when a file's blocks need to be combined or when an input file has been first split into several smaller files for parallel processing, but after processing the separate output files need to be recombined prior to creating the single output stream.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] Features of the present invention will become apparent to those skilled in the art from the following description with reference to the figures, in which:

[0006] FIG. 1 depicts a simplified block diagram of a virtual distributed file system, according to an embodiment of the invention;

[0007] FIG. 2 depicts a flow diagram of a method for optimizing a scheduling of file block communications from a plurality data nodes to a client in a virtual distributed file system, according to an embodiment of the invention;

[0008] FIG. 3 depicts a diagram of a time period during which a particular file block may be communicated from a data node, according to an embodiment of the invention;

[0009] FIG. 4 depicts a chart of an example file block communications schedule, according to an embodiment of the invention;

[0010] FIG. 5 shows a diagram of a set of decision variables arranged along a time slot, according to an embodiment of the invention;

[0011] FIG. 6 shows a chart of an example file block communications schedule in which contingency periods are provided, according to an embodiment of the invention; and

[0012] FIG. 7 illustrates a computer system, which may be employed to perform various functions of the client and/or the data nodes depicted in FIGS. 1 A and 1B in performing some or all of the steps contained in the diagram depicted in FIG. 2, according to an embodiment of the invention. DETAILED DESCRIPTION

[0013] For simplicity and illustrative purposes, the present invention is described by referring mainly to an exemplary embodiment thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent however, to one of ordinary skill in the art, that the present invention may be practiced without limitation to these specific details. In other instances, well known methods and structures have not been described in detail so as not to unnecessarily obscure the present invention.

[0014] Disclosed herein is a method for optimizing a scheduling of file block communications for a client from a plurality of data nodes in a virtual distributed file system, in which the data nodes contain redundant copies of the file blocks. The optimized scheduling of the file block communications is framed as an optimization-constraint problem, the solution to which provides the optimized scheduling of file block communications subject to the constraints. In addition, the scheduling of the file block communications may be considered to be optimized when the amount of time required to communicate the file blocks is minimized subject to one or more constraints.

[0015] Through implementation of the method disclosed herein, multiple file blocks may substantially concurrently be communicated from multiple data nodes to a client. In addition, the client may process the received file blocks to ensure that they are arranged in order, and may output an output stream containing the arranged file blocks. The scheduling of the file block communications may also be considered to be optimized when the output stream is able to be outputted without interruption and with the file blocks being stored in a buffer for a relatively short period of time. One potential application of the method disclosed herein is in high speed printer operations, which are able to consume large amounts of printing data at fast rates. Another potential application of the method disclosed herein is in multi-media applications, such as, streaming video applications.

[0016] With reference first to FIG. 1, there is shown a simplified block diagram of a virtual distributed file system 100, according to an example. It should be understood that the virtual distributed file system 100 may include additional components and that one or more of the components described herein may be removed and/or modified without departing from a scope of the virtual distributed file system 100.

[0017] As shown, the virtual distributed file system 100 includes a plurality of data nodes 102a-102n, where n is a value greater than 1. The data nodes 102a- 102n, which may comprise computer servers, are depicted as each including a respective hard drive 104 and an optional writer 170. The writer 170 is considered optional for reasons discussed below. The hard drives 104 are depicted as storing various file blocks (B₁-B_N) of a first file (F₁) and file blocks (B₁-B_N) of a second file (F₂). As also shown in FIG. 1 , both of the hard drives 104 in a first data node 102a and in a last data node 102n contain at least some of the same file blocks. In addition, each file block of each file may be stored in at least two of the data nodes 102a for redundancy and improved read rate purposes.

[0018] Although the hard drives 104 have been depicted as being integrated with respective data nodes 102a-102n, the hard drives 104 may comprise separate devices from the data nodes 102a-102n. In addition, the number of files (F) and file blocks (B) shown in the data nodes 102a-102n are for purposes of illustration only and are thus not intended to limit the data nodes 102a-102n in any respect. Moreover, one or more of the hard drives 104 may store less than all of the file blocks of a particular file.

[0019] The system 100 is also depicted as including a name node 110, which may comprise a computer server, for instance. The name node 110 is depicted as including a processor 112 and a hard drive 114. In addition, the hard drive 114 is depicted as having stored thereon information pertaining to the locations (data node) and identifications (file and file block) of the files (F) and file blocks (B) stored on the hard drives 104 of the data nodes 102a-102n. Thus, for instance, the processor 112 may query the data nodes 102a at various intervals of time to determine the identifications and locations of the file blocks (B). In addition, or alternatively, the data nodes 102a-102n may be configured to submit the identity and location information of the file blocks (B) to the name node 110. As a yet further alternative, the name node 110 may receive the identity and location information of the file blocks (B) concurrently with the storage of the file blocks (B) into the hard drives 104 of the data nodes 102a-102n.

[0020] The system 100 is also depicted as including a client 120, which may comprise, for instance, a personal computer, a laptop computer, a personal digital assistant (PDA), a mobile telephone, a printing device, etc. The client 120 may also form part of a personal computer, laptop computer, a PDA, a mobile telephone, a printing device, etc. In any regard, the client 120 is depicted as including a processor 122, an input/output module 124, a plurality of readers 126a- 126n, a buffer 128, and a data store 130. The client 120 may include additional components depending upon the device type of the client 120. For instance, in instances where the client 120 comprises a printing device, the client 120 may include one or more printing components. In addition, although a single client 120 has been depicted in FIG. 1 , it should be understood that the system 100 may include any number of clients 120 that may operate individually or concurrently with respect to each other.

[0021] The system 100 is still further depicted as including an output device

140. The output device 140 may also comprise, for instance, a personal computer, a laptop computer, a personal digital assistant (PDA), a mobile telephone, a printing device, etc. In one regard, the client 120 is configured to process the file blocks (B) and to output the processed file blocks (B) to the output device 140, as discussed in greater detail herein below. The output device 140 may be optional because in various instances, the client 120 may perform all of the functions of the output device 140.

[0022] The processor 122 of the client 120 is configured to implement the input/output module 124 to communicate data to and from the name node 110 and, in certain embodiments, to communicate an output stream formed of correctly ordered file blocks (B) to the output device 140. The input/output module 124 may thus include hardware and/or software to enable the processor 122 to connect to the network 150 through a wired or wireless connection. In addition, some of the data received through the input/output module 124 may be stored in the data store 130, which may comprise volatile and/or non-volatile memory, such as DRAM, EEPROM, MRAM, flash memory, and the like. In addition, or alternatively, the data store 130 may comprise a device configured to read from and write to a removable media, such as, a floppy disk, a CD-ROM, a DVD-ROM, or other optical or magnetic media.

[0023] In any regard, according to an embodiment, the processor 122 is configured to implement two or more of the readers 126a-126n, which may comprise threads of execution, or separate co-ordinated processes, to substantially concurrently read file blocks (B) from the data nodes 102a-102n over one or more time periods. In this embodiment, the writers 170 are considered to be optional because the readers 126a-126n are configured to read the file blocks (B). According to another embodiment, the processor 122 is configured to send a trigger to selected ones of the data nodes 102a-102n for the optional writers 170 to substantially concurrently push the file blocks (B) to the client 120. In this embodiment, the readers 126a-126n may be considered to be optional because the writers 170 may directly write the file blocks (B) into the buffer 128.

[0024] In any regard, the communicated file blocks (B), which are either written or read as discussed above, are configured to be stored in the buffer 128 and the processor 122 is configured to process the stored file blocks (B) to cause the stored file blocks (B) to be compiled back into a correct order of the file (F) from which the file blocks (B) were derived. In addition, the client 120 may output a stream containing the correctly ordered file blocks (B). According to an example, the stream may comprise a multimedia stream and the output may comprise an audio output and a visual output of the client 120 or the output device 140. According to another example, the stream may comprise data to be used for a printing operation on the client 120 or the output device 140.

[0025] In any regard, the processor 122 is configured to optimize the scheduling of the file block (B) communications from the plurality of data nodes 102a-102n subject to one or more constraints. More particularly, for instance, the processor 122 is configured to derive a schedule for communicating the file blocks (B) from the plurality of data nodes 102a-102n that substantially minimizes the amount of time required to communicate all of the file blocks (B) of one or more files (F) from the plurality of data nodes 102a-102n. That is, the processor 122 is configured to determine a schedule that defines from which data nodes 102a-102n which of the file blocks (F) are to be communicated at each time interval, in which the schedule requires the least amount of time, while complying with or satisfying the one or more constraints. Various manners in which the processor 122 may operate as well as the one or more constraints themselves are discussed in greater detail herein below.

[0026] According to an example, the processor 122 is configured to implement or execute an algorithm containing computer-readable instructions for optimizing the scheduling of the file block communications. The algorithm may be stored on a computer readable storage medium that is either integrated with or external to the client 120, such as the data store 130. In another example, the algorithm may be stored on a hardware device, such as an electronic circuit component, that the processor 122 is configured to implement.

[0027] Various components contained in the system 100 are configured to communicate data to and from each other over a first sub-network 150 and a second sub-network 160, each of which comprises a wired or wireless communication link between the system 100 components. The first sub-network 150 and the second sub-network 160 may each comprise a local area network, a wide area network, the Internet, or a combination thereof. As shown in FIG. 1 , the first sub-network 150 links the client 120 to the data nodes 102a-102n through, for instance, a switch (not shown). The connections between the data nodes 102a- 102n and the switch may be a relatively slower connection as compared with the connection between the switch and the client 120. In addition, the second subnetwork 160 is depicted as linking the client 120 to the output device 140 such that an output stream may be written to the output device 140 through the second subnetwork 160. Through use of the first and second sub-networks 150 and 160, bandwidth need not be shared during the communication of file blocks from the data nodes 102a-102n and the outputting of the output stream to the output device 140. Although not explicitly shown in FIG. 1 , each of the data nodes 102a-102n, the name node 110, the client 120, and the output device 130 may include hardware and/or software for communicating and receiving data over the network 150. [0028] Alternatively, however, some of the components may be configured to communicate directly to each other instead of over the network 150 or 160. By way of example, the output device 140 may be connected directly to the client 120, for instance, through a serial or USB connection. In this example, the client 120 may comprise a computing device and the output device 140 may comprise a printer attached to the client 120.

[0029] Generally speaking, the client 120 may derive a schedule for communicating the file blocks of one or more files substantially concurrently from multiple ones of the plurality of data nodes 102a-102n in the distributed file system 100. In addition, the schedule may be derived such that it enables multiple file blocks to be substantially concurrently communicated while minimizing the amount of time required to communicate the file blocks. In one example, the schedule may include communication of the file blocks in a round-robin manner, such that, the file blocks are read in order from available data nodes 102a-102n. The selection of which of the available data nodes 102a-102n from which the file blocks are to be communicated may be based upon, for instance, the access rates for the data nodes 102a-102n, current loading on the data nodes 102a-102n, etc. In another example, the schedule may include communication of the file blocks in a "shortsighted" manner, in which, a decision on the best file block and data node 102a- 102n to communicate the file block is made at each decision point. In a further example, the schedule may include a relatively longer sighted view of the communication of the file blocks that results in the file blocks of each file being communicated in order.

[0030] Turning now to FIG. 2, there is shown a flow diagram of a method

200 for optimizing a scheduling of file block communications from a plurality data nodes 102a-102n to a client 120 in a virtual distributed file system 100, according to an example. It should be understood that the method 200 may include additional steps and that one or more of the steps described herein may be removed and/or modified without departing from a scope of the method 200.

[0031] The descriptions of the method 200 is made with reference to the virtual distributed file system 100 depicted in FIG. 1 and thus makes particular reference to the elements contained in the system 100. It should, however, be understood that the method 200 may be implemented in a system that differs from the system 100 without departing from a scope of the method 200.

[0032] As shown in FIG. 2, at step 202, discrete time periods for communicating the file blocks (B) of one of more files (F) may be set. An example of a manner in which the discrete time periods may be set is provided with respect to FIG. 3, which depicts a diagram 300 of a time period 302. More particularly, each time period 302 may composed of a time frame that is equivalent to an average time spent communicating a file block (B) 304 and a tolerated variance 306. The average time spent communicating a file block (B) 304 may be compiled through an analysis of historical communications of file blocks (B) from the data nodes 102a-102n. In addition, the tolerated variance 306 may comprise a length of time to allow for some relatively small tolerance in the variation in communicating times, set to a small fraction of the average time spent communicating a file block. The time period 302 over which a file block (B) is communicated is depicted in FIG. 3 is for purposes of developing a model as discussed in greater detail herein below. As such, during the actual communication of the file blocks (B), communication of a subsequent file block may begin immediately following completion of the communication of a prior file block.

[0033] With reference back to FIG. 2, at step 204, the processor 122 may receive information regarding the identifications and the locations of the file blocks (B) from the name node 110. Although not shown, the processor 122 may receive an instruction, for instance, from a user or a computing device, for the one or more particular files (F₁ , F₂), which may be considered as a set of file blocks that may be considered as a single file in totality, to be communicated to the client 120, prior to step 204. Thus, for instance, the processor 122 may query and receive the information regarding the identifications and the locations of the file blocks (B) from the name node 110. As discussed above, the hard drive 114 of the name node 110 may contain this information.

[0034] At step 206, a scheduling optimization problem subject to one or more constraints is developed based upon the information received from the name node 110. The scheduling optimization problem includes choosing an ordered communication of file blocks from the data nodes 102a-102n to minimize the length of time required to communicate the entire file and thus achieve the highest possible output stream rate. The scheduling optimization problem is also subject to one or more constraints. More particularly, the processor 122 may determine for each time period 302 from which of the data nodes 102a-102n to communicate a particular file block (B). A chart 400 depicting an example communication schedule is provided in FIG. 4. In the chart 400, a "1" indicates a scheduled communication of a file block (B) from a data node (N) at a time period (T) and a "0" indicates that a communication of a file block (B) is not scheduled for that data node (N) at that time period (T).

[0035] As shown in FIG. 4, a communication of a first file block (B) is scheduled at the first time period (T) from the first data node (N₁) and a communication of a second file block (B) is also scheduled at the first time period (T) from the third data node (N₃). In addition, at the second time period (2T), a communication of a fourth file block (B) is scheduled from the first data node (N₁) and a communication of a fifth file block (B) is scheduled from the second data node (N₂). The communication of the file blocks (B) may be scheduled for a total number of time periods (NT) until all of the file blocks (B) of one or more desired files (F) are scheduled to be communicated.

[0036] In addition, the maximum number of communications that may be performed at any time period may be limited by the total number of threads of execution or readers 126a-126n that are available to the client 120 for the communications. Moreover, or alternatively, the maximum number of communications that may be performed at any time period may be limited by the bandwidth available to the client 120 for communicating the file blocks (B) from the data nodes 102a-102n.

[0037] According to an example, the processor 122 is configured to develop the scheduling optimization problem at step 206 as a cost function minimization problem, which is subject to a set of constraints. By way of particular example, the cost function (C) is defined as:

[0038] Equation (1): C = J ^∑[t + buffer Cost{t,i)]χ b_im .

t=\ _π=l I=I [0039] In Equation (1 ), b_im is a binary valued decision variable that represents a decision to communicate or not communication a replica of the ithfύe block from a data node (n) at time period t, r_max is the maximum number of time slots, M is the total number of nodes and B is the total number of blocks. In addition: bufferCost(t, i) = NA where i— t < 0 (would break communicate by constraint) buffer Cost(t,i) = ceil(i/R)-t where ceil(i/R)- t = 0, l, 2,..., W;

bufferCost(t,i) = (ceil(i/R)-tf where ceil(i/R)-t≥W + 1; in which W represents a window size that effectively allows the cost function to have a linear additional cost and R is the number of readers. The function ceil(i/R) is used instead of the actual block index T to ensure that R blocks can be communicated per time slot without triggering a penalty for communicating a block too far in advance. Wmay be varied according to the amount of buffering that is to be used. For instance, if more buffering is available, then W may be increased. An intent of the buffer cost is to encourage blocks to be read a relatively short time before they are needed, and to penalize communication of the blocks far in advance of when they are needed.

[0040] According to an example, the cost function (C) may be modified to suit a particular set of system resources. In addition, the decision variables are weighted by time to encourage communications to occur as soon as possible. Moreover, the cost function in Equation (1 ) is linear in the set of decision variables with each decision variable weighted by a scalar term. The cost function may thus be represented as an inner product:

[0041] Equation (2): C = c.x_b .

[0042] In Equation (2), x_b is the vector of decision variables b_m and c is a vector of constant coefficients computed according to the cost function equation in Equation (1 ). The set of decision variables are arranged in the vector x_b as illustrated in the diagram 500 in FIG. 5. As shown therein, the decision variables b_im are logically ordered first by file block number, then data node number and finally by time slot (only one full time slot has been shown for convenience). In the example depicted in FIG. 5, 8 file blocks are illustrated for purposes of simplicity only and should thus not be considered as limiting the present invention in any respect.

[0043] Thus the entry representing b_im in x_b may be given by the index: i + (n-1 )*N + (t-1 )*N *M, where N is the number of data blocks to be communicated and M is the number of data nodes 102a-102n. The solution to the integer programming problem is the vector x_b , which defines the set of decision variables and thus the schedule for communicating the file blocks. The term b_ln is a binary variable that indicates whether the /th block is on node n (value is 1 ) or not (value is O).

[0044] As discussed above, the scheduling optimization problem is subject to one or more constraints. In other words, the schedule is optimized while satisfying the one or more constraints. Various examples of constraints to which the scheduling optimization problem may be subjected to are discussed below.

[0045] In a first example, the optimization may be subject to a communicate once constraint as, for instance, defined by the following equation:

[0046] Equation (3): ∑ ∑b_itn = l, Vz

(=1 n=\

In Equation (3), and in subsequent equations, the terminology Vz means "for all /" which means that for each the value of variable "i", the constraint equation must then hold. Thus, Equation (3) strictly represents a set of constraint equations, one per each value of /^'.

[0047] As indicated by Equation (3), over all time, the ith block is communicated only once. Referring to FIG 4, a third axis may be imagined coming out of the page which represents the index, T, of the file block being read from a data node 102a-102n in a given time slot. Equation (3) amounts to ensuring that there is only a single non-zero entry in the whole time-node plane of FIG. 4, for each block index T. In FIG. 4, each time-node plane for each block index has been effectively projected, or flattened, onto a single plane, to better illustrate the example schedule.

[0048] In another example, the optimization may be subject to a collision constraint as, for instance, defined by the following equation:

JV

[0049] Equation (4): ∑b^≤l, Vf, n where b_in≠0 .

i

[0050] As indicated by Equation (4), only at most one file block is communicated at each time period from each data node 102a-102n. The collision constraint of Equation (4) may be implemented because if multiple communications from the same data node were allowed during the same time period, then the communication operations would interfere with each other because the communications would keep alternately moving the disk reading head between different parts of the disk when reading blocks from each of the different file blocks.

[0051] In a further example, the optimization may be subject to a modified communicate once constraint as, for instance, defined by the following equation:

[0052] Equation (5): ∑ b_im - ε_m + 2ε_nt__λ≤ 1, Vf, n where b_ln≠0

i

[0053] In Equation (5), ε_nt is a variable for each data node n and time slot t that is binary valued. If ε_nt is set to one this can allow two values of b_itn to be nonzero (that is, two blocks from the same data node at the same time may be read) provided ε_nt_₁ is zero. If ε_nt__ι is equal to one then no more communications are allowed because the data node is still being communicated from the previous time slot. Thus, Equation (5) indicates that a collision may be allowed, for instance, to allow support for a two-on-one collision, that is, two readers reading a file block from the same data node. In general, this will, on average more than double the time to communicate the file block and therefore the same data node cannot be communicated in the next time slot. [0054] In a yet further example, the optimization may be subject to a completion constraint as, for instance, defined by the following equation:

[0055] Equation (6): b_im = 0 for t > r_max .

[0056] In Equation (6), the number of time slots is limited to T_msκ and the value of T_max impacts the number of decision variables and is thus selected to be as small as possible for reasons of efficiency while still permitting a feasible solution to be found. If the condition set forth in Equation (6) cannot be met, then there will be no feasible solution to the scheduling optimization problem. Generally, the cost function will drive a solution to minimise the time to complete the schedule and thus it may be acceptable to choose T_max to be little larger than would be anticipated to guarantee a feasible solution can be found without expanding the problem size excessively.

[0057] In a yet further example, the optimization may be subject to a communicate-by constraint. In other words, the ith file block should be communicated up to and including the ith time period. This is the deadline for communicating the ith file block because that file block should be streamed during time period time /+1 , so it must be in the buffer at the end of the ith time period. According to an example, for each data node n, storing file block /^',

0≤b_iln≤l, Vi≤t

Z>. = 0 Vx > f ' where the second part may be expressed in a single constraint formed from the sum:

∑∑^έ _tø ^{= 0} Vi where b_in = l .

(=;+l _π=l

[0058] In a yet further example, the optimization may be subject to a buffer constraint. This constraint enforces limits on the maximum number of file blocks that may be stored in the buffer 128. This may be approximated by modifying the communicate-by constraint to enforce that the ith block may only be communicated within m time periods before the ith deadline, as denoted by the following: 0≤b_m≤l, Vt - m≤i≤t

ό ^Ji.t,n = 0 Vi < t-m and i > t

[0059] According to an example, instead of a "hard" buffer constraint, the cost function (C) discussed above with respect to Equation (1 ) allows for the cost of buffering to be included. This is because the bufferCost function assigns no cost to storing a file block in memory if it is communicated 1 or 2 time slots before it is required to be serialized to the output. However, if the file block is buffered more than, for instance, 2 time slots before it is required, a higher penalty cost is assigned, given by a quadratic expression.

[0060] In a yet further example, the optimization may be subject to a reader constraint. In this constraint, the number of reader 126a-126n threads executed at the client 120 is limited to a maximum R. In addition, for each time period t, there may be at most R non-zero decision variables. By way of example, the number of readers 126a-126n may be limited by the processor 122 or networking resources available to the client. An example of the reader constraint may be defined according to the following equation:

M N

[0061] Equation (7): ∑∑6_tø≤Λ, V/ ,

B=I (=1 which geometrically may be interpreted as allowing R, non-zero values in each node-block index plane defined by each time slot.

[0062] In a yet further example, the optimization may be subject to a failure constraint. In this constraint, a sufficient number of spare time slots are reserved to perform a communication from a secondary data node to provide a contingency in the event of a communication failure by a primary data node. In addition, the second communication from the secondary data node is required to occur later than the first communication and from a different data node than the primary data node. Moreover, the maximum number of time slots held in reserve may be set to, for instance, a few more than the maximum number of file blocks stored on any given data node 102a-102n to substantially ensure that if a data node 102a-102n completely fails, then all of the file blocks on that data node 102a-102n will be communicated from one or more of the remaining nodes 102a-102n. [0063] The failure constraint may be addressed by decomposing the communications schedule into multiple parts, where between the completion of one schedule and the start of the next, a whole time period is allocated to allow for communication of the data blocks that could not be communicated at the allotted time. By way of particular example in which a client 120 has 4 reader threads, the content at the output may be streamed at 4 times the speed that the content is read from any given data node 102a-102n. In this example, the output stream should be written at a slower rate than the maximum speed at which the content may be streamed to allow for some contingency to re-communicate file blocks from one or more other data nodes 102a-102n, in the event that one or more of the data nodes 102a-102n fail. In addition, the re-communication of the file blocks should be scheduled such that the failed node is avoided. To meet these requirements, for instance, every 4^th timeslot may be left free to provide sufficient time for re- communications to be performed.

[0064] An example of a communications schedule in which contingency periods are provided is depicted in the diagram 600 in FIG. 6. As shown therein, a relatively large number of file blocks, for instance, 12 file blocks are buffered startup prior to writing any of the data blocks to the output stream. For convenience of illustration, FIG. 6 assumes that the blocks are readable in order from the data nodes 102a-102n, however, this may not typically be the case in practice. Thus, in the example depicted in FIG. 6, file block 1 is communicated from data node 1 at time slot 1 , etc.

[0065] As also shown in FIG. 6, the fourth time slot is a first planned contingency period. However, because each of the file blocks 1-12 have been communicated, none of the file blocks 1-12 are required to be communicated during the first planned contingency period. Continuing along to the 5^th-7^th time slots in the diagram 600, the data node 3 (N3) is depicted as having failed in the 6^th and 7^th time slots, which are shown as shaded squares. As such, file blocks 19 and 23 have been communicated in error or have not been communicated at all. Thus, at the 8^th time slot, which is the second planned contingency period, the file blocks 19 and 23 are re-communicated from the second data node 2 (Λ/2) and the fourth data node 4 (A/4), respectively, which are depicted with underlines. Thus data nodes 2 and 4 are re-communicated during the second contingency period and are communicated from other data nodes than data node Λ/3. In fact all communications from data node N3 now need to be re-scheduled. Hence, file blocks 27, 31 , 35, etc., need to be communicated from data nodes other than the data node N3 during future contingency periods.

[0066] As discussed above, the contingency periods are scheduled to enable sufficient opportunity to re-communicate all of the file blocks scheduled to be communicated from any one data node 102a-102n, while still maintaining one file block per data node during the contingency periods. To represent the contingency constraint, additional constraints are applied while adjusting other constraints. For instance, a contingency period must be provided at least as often as there are numbers of readers. As another example, the contingency period may be provided more often than there are numbers of readers to ensure that a re- communication of a file block may occur fairly soon after the file block's first scheduled communication in order to minimize the number of data blocks held in the buffer 128 and to satisfy the communicate-by constraint discussed above.

[0067] For instance, t_c is defined to be a contingency time, which may take one of the values from the set of contingency time periods denoted S_c . Assuming that one data node failure is allowed at any time, the contingency constraints in this particular example illustrated in Fig 6 are of the form:

M

[0068] Equation (8): b_Mβ_₃ -∑b^ = 0 ,

J=I

j≠n

M

[0069] Equation (9): b_mtc_₂ -∑b^ = 0, and

J=I

j≠n

M

[0070] Equation (10): b^ -∑b^O .

J=I

j≠n for Vn, Vz and Vt_c where b_m = 1 and t_c e S_c

[0071] In effect, the constraints ensure that for each block communicated in the preceding three time slots could be re-communicated in the following contingency block. As noted by Equations (8)-(10), there are three time periods between each contingency period, hence three separate sets of equations. The set of contingency times are, therefore, S_c = {4T,8T,12T,...,}. Equations (8)-(10) may be adapted to arbitrary frequency of contingency periods. In addition, Equations (8)-(10) model the failure of each data node. The constraints allow for a failure of either Data Node 1 or Data Node 2 or Data Node 3 fails or Data Node 4. Under each condition, a schedule is to be derived for communicating the file blocks scheduled to be communicated from each failed data node from the remaining working nodes. If a solution to the scheduling optimization problem is sought using all of these constraints simultaneously, then this will conflict with the collision constraint which must, therefore, be relaxed to:

[0072] Equation (11 ):

M

∑ b._m < 3, Vn,t e S_c and where b_in≠ 0

!=1

[0073] In Equation (11 ), the right hand limit has been increased from 1 to 3, which is one less than the contingency period because the node will itself will be modelled as failed in one set of the equations above. Also, the communicate-once constraint will also need to be modified so that the communicate-once constraint is not applied at time which is one of the contingency time period slots.

[0074] In a yet further example, the optimization may be subject to an initial conditions constraint. In this constraint, the distribution of the file blocks is required to be known. The distribution of the file blocks may be represented by the variables b_in . These variables are either 0 or 1 indicating that the ith block is stored on the nth data node or not.

[0075] At step 208, a solution to the scheduling optimization problem subject to the one or more constraints is determined to derive a schedule for optimized communication of the file blocks (B) from two or more of the data nodes 102a- 12On. According to an example, the scheduling optimization problem may be solved using any of several different techniques. Example techniques include integer programming (branch and bound), genetic algorithms, random sampling guided by heuristics, etc. [0076] For large files that have been split into a relatively large number of file blocks, there exist a relatively large number of decision variables. For example, in a distributed file system 100 with 32 data nodes 102a-102n and 100 file blocks, there are 3200 decision variables. In addition, assuming that a client 120 includes 4 readers 126a-126n, at best there will be 25 time periods before all of the file blocks may be written, which equates to approximately 80000 decision variables. According to an example, the number of decision variables may substantially be reduced by decomposing the scheduling optimization problem into a plurality of sub-problems. In this example, for instance, a schedule for communicating a first set of P file blocks and a schedule for communicating a second set of P file blocks and so on may separately be derived.

[0077] At step 210, the file blocks (B) are communicated from two or more of the data nodes 102a-102n according to the schedule derived at step 208 and stored in the buffer 128. In addition, at step 212, the processor 122 processes the communications and stored file blocks to compile them in the correct order. Moreover, at step 214, the processor 122 may output the complied file blocks as an output stream to be used by the client 120 or to be outputted to the output device 140.

[0078] According to an example, the method 200 may be modified to also provide an optimized communications schedule when more than one file is to be concurrently communicated by multiple clients 120, as discussed in greater detail herein below. This situation may arise, for instance, when there are multiple printers that are concurrently printing multiple files stored on the distributed file system 100. The method 200 may also be modified such that at least one of multiple clients 120 writes file blocks to the data nodes 102a-102n, while at least one of multiple_, clients 120 communicates file blocks from the data nodes 102a- 102n. The method 200 may further be modified to cause the file blocks to be allocated to the data nodes in a manner that ensures that an efficient communications ordering of the file blocks is possible, subject to the one or more constraints.

[0079] In instances where there are multiple concurrent clients communications from the distributed file system 100, two situations may arise. In a first situation, the readers are executing on the same client 120 and in a second situation, the readers are executing on separate clients 120. This distinction only affects the constraint on the number of readers, either the readers have to be split between the different files or else there are twice as many readers available in total with one half of the total allocated to read one file each.

[0080] To allow for multiple concurrent clients 120, the optimized scheduling problem may be approached in one of two ways. In a first approach, the solution to the first client problem may be determined, for instance, under some possibly stricter constraints to avoid crowding out the second client, and the solution may be used as an additional set of constraints to apply to the solution to the second client problem. In a second approach, both the first client problem and the second client problem may be solved concurrently. In the former case, if data node n is occupied for some set of times ^₁ ....t_p\ by the first client, then that data node should not be communicated from during those times by the second client. This condition may be expressed in equation form as:

M

[0081] Equation (12): ∑h_M = 0 for \/ n and t ε fa .... t_p).

i=\

[0082] In the latter case, the constraint equations described above may be combined into a larger optimized scheduling problem. In developing the larger optimized scheduling problem, a decision to combine equations may be made on a case by case basis. By way of example, consider that the constraint equations discussed above have been represented in matrix form for solution, for instance, using an engineering program such as MATLAB. Thus for the first client problem and the second client problem initially have two independent problems to minimize the sum of the cost functions in Equation (1 ) for each problem subject to:

[0083] Equation (13): 4*i^≤£i and 4^≤-2₂.

[0084] Each set of constraint coefficients may be taken from matrix A_x ,

(represented by a row in A_x ) in turn yields a set of vectors a^* and aj from the first and second problems respectively where j is the jth row from the matrix. The combined constraints may be represented as: [0085] Equation (14): ^Ac£c≤k_{C j} where

\ b₂], will require combining the vectors from the sub-problems in different ways according to the type of constraint they represent. For the communicate-once and block location constraints, all such constraint equations from the sub-problems may be paired in the following manner, which effectively indicates that each row of matrix A_c is zero padded, by one of the following ways:

I ek^{≤ 1}

[Q I «;k^{≤ 1}

according to whether the constraint arose from the first or second sub-problem respectively and where 0 is an appropriately dimensioned vector of zeros. Forthe reader and communicate-by constraints the constraints may be combined into a single constraint as indicated in the following equation:

[0086] Equation (15): [a\ a) X , where X is either 1 or the number of readers and depends on the original constraint.

[0087] Some or all of the operations set forth in the method 200 may be contained as one or more utilities, programs, or subprograms, in any desired computer accessible or readable storage medium. In addition, the method 200 may be embodied by a computer program, which may exist in a variety of forms both active and inactive. For example, it can exist as software program(s) comprised of program instructions in source code, object code, executable code or other formats. Any of the above can be embodied on a computer readable storage medium.

[0088] Exemplary computer readable storage devices or medium include conventional computer system RAM, ROM, EPROM, EEPROM, and magnetic or optical disks or tapes. Concrete examples of the foregoing include distribution of the programs on a CD ROM or via Internet download. It is therefore to be understood that any electronic device capable of executing the above-described functions may perform those functions enumerated above.

[0089] FIG. 7 illustrates a computer system 700, which may be employed to perform the various functions of the client 120 described herein above, according to an example. In this respect, the computer system 700 may be used as a platform for executing one or more of the functions described hereinabove with respect to the client 120.

[0090] The computer system 700 includes a processor 702, which may be used to execute some or all of the steps described in the method 200. Commands and data from the processor 702 are communicated over a communication bus 704. The computer system 700 also includes a main memory 706, such as a random access memory (RAM), where the program code may be executed during runtime, and a secondary memory 708. The secondary memory 708 includes, for example, one or more hard disk drives 710 and/or a removable storage drive 712, representing a floppy diskette drive, a magnetic tape drive, a compact disk drive, etc.

[0091] The removable storage drive 710 reads from and/or writes to a removable storage unit 714 in a well-known manner. User input and output devices may include a keyboard 716, a mouse 718, and a display 720. A display adaptor 722 may interface with the communication bus 704 and the display 720 and may receive display data from the processor 702 and convert the display data into display commands for the display 720. In addition, the processor 702 may communicate over a network, for instance, the Internet, LAN, etc., through a network adaptor 724.

[0092] It will be apparent to one of ordinary skill in the art that other known electronic components may be added or substituted in the computer system 700. In addition, the computer system 700 may include a system board or blade used in a rack in a data center, a conventional "white box" server or computing device, etc. Also, one or more of the components in FIG. 7 may be optional (for instance, user input devices, secondary memory, etc.). [0093] What has been described and illustrated herein is a preferred embodiment of the invention along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Those skilled in the art will recognize that many variations are possible within the scope of the invention, which is intended to be defined by the following claims— and their equivalents— in which all terms are meant in their broadest reasonable sense unless otherwise indicated.

Claims

What is claimed is:

1. A method for optimizing a scheduling of file block communications for a client from a plurality of data nodes in a virtual distributed file system, wherein the plurality of data nodes contain redundant copies of the file blocks, said method comprising steps performed by a processor of:

deriving a schedule for communicating the file blocks substantially concurrently from multiple ones of the plurality of data nodes that minimizes the amount of time required to communicate the file blocks from the multiple ones of the plurality of data nodes to the client.

2. The method according to claim 1, further comprising steps performed by a processor of:

setting discrete time periods for communicating the file blocks;

setting one or more constraints associated with communicating the file blocks; and

wherein deriving the schedule further comprises deriving the schedule for communicating the file blocks substantially concurrently from multiple ones of the plurality of data nodes in the discrete time periods that minimizes the amount of time required to communicate the file blocks from the multiple ones of the plurality of data nodes to the client subject to the one or more constraints.

3. The method according to claim 2, wherein deriving the schedule further comprises deriving the schedule by minimizing a cost function associated with communicating the file blocks from the multiple ones of the plurality of data nodes subject to the one or more constraints.

4. The method according to claim 3, wherein minimizing the cost function further comprises minimizing the cost function (C) through solving for:

r_mnx M N

^{C =} Σ Σ tXt + bufferCostiUifob,,,, ,

(=1 „=1 i=l

wherein b_itn is a binary valued decision variable that represents a decision to communicate or not communicate a replica of the ith file block from a data node (AJ) at time period t, and wherein bufferCost(t, ϊ) = NA where i— t < 0 (would break communicate by constraint) bufferCost(t,ϊ) = ceil(i/R)-t where ceil(i/R)- t = 0, l, 2,..., W;

bujferCostit, ϊ) = (ceil(i/R) - tf where ceil(i/R) -t≥W + 1; wherein W represents a window size that effectively allows linear cost and R is the number of readers.

5. The method according to any of claims 2-4, further comprising: developing a scheduling optimization problem with one or more constraints; and

wherein deriving the schedule further comprises solving the scheduling optimization problem subject to the one or more constraints, wherein the solution to the scheduling optimization problem results in the schedule.

6. The method according to claim 5, further comprising:

decomposing the scheduling optimization problem into a plurality of sub- problems, and solving for the plurality of sub-problems to derive a plurality of schedules for communicating the file blocks.

7. The method according to any of claims 2-6, wherein setting the one or more constraints further comprises setting the one or more constraints to include at least one of a constraint that each of the file blocks is communicated only once, a constraint that one file block during each time period at most is communicated from each data node, a constraint that all of the file blocks are communicated in a maximum allowed time, a constraint that an Λh block is communicated by the end of the rth time period, a constraint that limits a maximum number of file blocks that are storable in a buffer based upon a maximum size of the buffer, a constraint that limits the a number of reader threads executed at the client at each time period, a constraint that there are sufficient spare time periods reserved to perform a communication from a secondary data node in the event of a communication failure from a primary data node, and a constraint that the distribution of the file blocks in the plurality of data nodes is known prior deriving the schedule.

8. The method according to any of claims 2-7, wherein setting the discrete time periods for communicating the file blocks further comprises setting at least one contingency period between the discrete time periods, wherein the contingency period is positioned within the discrete time periods to enable sufficient opportunity for file blocks that have been mis-communicated to be re- communicated from the data nodes.

9. The method according to any of claims 1 -8, further comprising: communicating the file blocks from the plurality of data nodes over a plurality of the discrete time periods;

storing the communicated file blocks in a buffer;

compiling the stored file blocks to arrange the file blocks in a correct order; and

outputting the complied filed blocks as an output stream.

10. An apparatus for optimizing a scheduling of file block communications for a client from a plurality of data nodes in a virtual distributed file system, wherein the plurality of data nodes contain redundant copies of the file blocks, said apparatus comprising:

a buffer for storing at least one of the communicated file blocks at a time; and

a processor for deriving a schedule for communicating the file blocks substantially concurrently from multiple ones of the plurality of data nodes, wherein the schedule minimizes the amount of time required to communicate the file blocks from the multiple ones of the plurality of data nodes to the apparatus.

11. The apparatus according to claim 10, wherein the processor is further configured to set discrete time periods for communicating the file blocks, set one or more constraints associated with communicating the file blocks, and derive the schedule for communicating the file blocks substantially concurrently from multiple ones of the plurality of data nodes in the discrete time periods that minimizes the amount of time required to communicate the file blocks from the multiple ones of the plurality of data nodes to the apparatus subject to the one or more constraints.

12. The apparatus according to claim 11 , wherein the processor is further configured to derive the schedule by minimizing a cost function associated with communicating the file blocks from the multiple ones of the plurality of data nodes subject to the one or more constraints.

13. The apparatus according to any of claims 11 and 12, wherein the processor is further configured to develop a scheduling optimization problem with one or more constraints and to solve the scheduling optimization problem subject to the one or more constraints, wherein the solution to the scheduling optimization problem results in the schedule.

14. The apparatus according to any of claims 11-13, wherein the processor is further configured to set the one or more constraints to include at least one of a constraint that each of the file blocks is communicated only once, a constraint that one file block during each time period at most is communicated from each data node, a constraint that all of the file blocks are communicated in a maximum allowed time, a constraint that an /th block is communicated by the end of the /th time period, a constraint that limits a maximum number of file blocks that are storable in a buffer based upon a maximum size of the buffer, a constraint that limits a number of reader threads executed at the client at each time period, a constraint that there are sufficient spare time periods reserved to perform a communication from a secondary data node in the event of a communication failure from a primary data node, and a constraint that the distribution of the file blocks in the plurality of data nodes is known prior deriving the schedule.

15. A computer readable storage medium on which is embedded one or more computer programs, said one or more computer programs implementing a method for optimizing a scheduling of file block communications for a client from a plurality of data nodes in a virtual distributed file system, wherein the plurality of data nodes contain redundant copies of the file blocks, said one or more computer programs comprising computer code for:

setting discrete time periods for communicating file blocks from the plurality of data nodes; setting one or more constraints associated with communicating the file blocks; and

deriving a schedule for communicating the file blocks substantially concurrently from multiple ones of the plurality of data nodes in the discrete time periods that minimizes the amount of time required to communicate the file blocks from the multiple ones of the plurality of data nodes to the client subject to the one or more constraints.