WO2014088624A1 - A method and apparatus for code length adaptation for access to key-value based cloud storage systems - Google Patents

A method and apparatus for code length adaptation for access to key-value based cloud storage systems Download PDF

Info

Publication number
WO2014088624A1
WO2014088624A1 PCT/US2013/030926 US2013030926W WO2014088624A1 WO 2014088624 A1 WO2014088624 A1 WO 2014088624A1 US 2013030926 W US2013030926 W US 2013030926W WO 2014088624 A1 WO2014088624 A1 WO 2014088624A1
Authority
WO
WIPO (PCT)
Prior art keywords
request
blocks
key
fec
portions
Prior art date
Application number
PCT/US2013/030926
Other languages
French (fr)
Inventor
Guanfeng LIANG
Ulas C. Kozat
Original Assignee
Ntt Docomo, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ntt Docomo, Inc. filed Critical Ntt Docomo, Inc.
Priority to US14/649,530 priority Critical patent/US20150309874A1/en
Publication of WO2014088624A1 publication Critical patent/WO2014088624A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/37Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
    • H03M13/3761Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35 using code combining, i.e. using combining of codeword portions which may have been transmitted separately, e.g. Digital Fountain codes, Raptor codes or Luby Transform [LT] codes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • Embodiments of the present invention relate to the field of storage systems; more particularly, embodiments of the present invention relate to the use of forward error correction (FEC) in the storage and retrieval of objects in storage systems.
  • FEC forward error correction
  • the delay for a single read or write operation for small objects can be 100s of milliseconds of delay while for medium size objects (e.g., > 1Mbyte) delays can become in the order of seconds at 99 th and 99.9 th percentiles.
  • medium size objects e.g., > 1Mbyte
  • these delays can be unacceptably high.
  • video content that consists of many megabytes, how to use S3 type storage as the video archive while attaining small startup delays and no pauses for video playback also has become a critical issue.
  • Figure 1 illustrates the system proposed in these papers for read-only scenario. Every file to be read is first divided into K equal-sized blocks and encoded into N coded blocks with a (N,K) FEC code. There are N servers and every server stores one different coded block for each file. To serve a file-read request, a request dispatcher issues K read operations to the first K servers that become idle for the K coded blocks stored on these servers. These K servers are kept active until all read operations for all K coded blocks have completed, and then they become idle again and can serve other file-read requests.
  • the dispatcher then performs FEC decoding to recover the original file from the K coded blocks.
  • every file is coded with a fixed (N,K) FEC code.
  • every request is served by the minimum number of exactly K parallel read operations (from K servers), i.e., zero overhead is introduced.
  • K servers i.e., zero overhead is introduced.
  • the second paper if a request is directed to read a coded chunk stored on a hot (heavily loaded) node, in parallel, they read extra data from other servers, try to reconstruct the chunk stored on the hot node, and provide that to the client.
  • FEC for storage durability/availability purposes while still trying to minimize the amount of data to be read.
  • the utility is usually modeled as a concave function of the throughput received by each session, which is in term determined by how much link capacity is allocated for that session on every link in the communication network.
  • the system designer has to allocate link capacities for each session so that the overall utility is maximized, and control the unicast/multicast/broadcast rate for each session so that the amount of traffic injected conforms to the allocated link capacities.
  • throughput is the only concern, and although coding is also used, it is used merely to achieve multicasting capacity for each session given the link capacity allocation. As a result, there is zero redundancy when using network coding.
  • a method and apparatus for code length adaptation for access to key-value based storage systems.
  • the method comprises receiving a data object and a request; dividing the data object into K portions, where K is an integer; selecting an FEC coding rate based on backlog associated with at least one queue; applying FEC coding based on the FEC rate set to the K portions to create N FEC coded data blocks, where N is an integer greater than or equal to K; and sending the N FEC coded data blocks to the storage system.
  • Figure 1 illustrates a prior art storage arrangement that uses FEC to reduce queuing delay in data centers.
  • Figure 2 illustrates an example of multiple multicasting sessions compete for network capacity.
  • Figure 3 is a block diagram of one embodiment of a storage system.
  • Figure 4 is a block diagram of one embodiment of an application executed by a store client.
  • Figure 5 is a flow diagram of one embodiment of a process for request handling performed by a classifier and a scheduler.
  • Figure 6 is a flow diagram of one embodiment of a process for read request handling by the request handler.
  • Figure 7 is a flow diagram of one embodiment of a process for write request handling by the request handler.
  • Figure 8 illustrates parallel threads that execute read/write (R/W) tasks obtain new tasks from a task queue when they are done servicing a current task.
  • R/W read/write
  • FIG 9 illustrates an example of thresholding Fc() functions, with two categories R (read) and W (write).
  • Figure 10 is a flow diagram of one embodiment of a process for determining Nc given a set of thresholds.
  • Figure 11 illustrates raw data in terms of delay performance for different cloud locations are stored in a database.
  • Figure 12 is a flow diagram of one embodiment of a process for computing thresholds for a category C.
  • Figure 13 is a flow diagram of one embodiment of a process for estimating Ac and ⁇ in the online fashion.
  • Figure 14 is a flow diagram of one embodiment of a process for storage controller such as a store client.
  • Figure 15 depicts a block diagram of a storage gateway or a client device.
  • Embodiments of the present invention include methods and apparatuses that adaptively adjust the level of code redundancy to provide robust throughput-delay tradeoff when using FEC code for delay improvement in storing and retrieving data objects including videos, images, documents, meta-data, etc. in public cloud-based storage such as, for example, Amazon S3 or in private cloud-based storage systems (potentially of different size and different delay requirements).
  • public cloud-based storage such as, for example, Amazon S3 or in private cloud-based storage systems (potentially of different size and different delay requirements).
  • FEC is beneficial because the time a request being served is significantly reduced by parallelism.
  • using FEC is detrimental because it creates redundant write or read requests which further increases system utilization and causes requests spending significantly more time waiting to be served.
  • Embodiments of the present invention adapt the FEC rate (including no coding) used by different categories of requests according to the backlog size, so that the overall delay
  • the techniques described herein can be used by the host devices where data is produced and/or consumed as well as by proxy nodes that sits between the host devices and public storage facility.
  • a public storage facility is accessed through using their API that opens connections between the API client (host or proxy nodes) and API server (residing in the storage facility).
  • clients can issue put, get, delete, copy, list, etc. requests where appropriate providing security credentials, local keys and global names to uniquely identify the objects, byte strings that represent the object, etc.
  • clients are agnostic to how their requests are operationally carried out within the public cloud, they are sensitive to end to end delays incurred in resolving their requests.
  • Measurement studies indicate that even when there is enough network bandwidth and the clients are very close to the storage nodes, there are substantial long tails in delay performance distributions with bottom 1% and 0.1% delay profiles observing much worse delays than the average performance. Measurements studies also indicate that the delay performances of parallel requests on different keys are weakly correlated.
  • Embodiments of the present invention use multiple categories of request and each category may use different FEC codes with different code dimension Kc.
  • Requests of category C can be served by (Nc-Kc) redundant read or write operations, in addition to the minimum Kc ones.
  • Nc is a time varying parameter updated on a per-request basis.
  • a feature of embodiments of the present invention is that it allows multiple categories of requests with various ranges of object sizes and delay distributions, and the amount of extra overhead for each category (governed by Nc) is adapted independently based on the system backlog.
  • embodiments of the present invention select an appropriate category for that client's requests. The number of redundant read or write operations for requests of different categories are then adjusted independently to deliver the performance.
  • the present invention also relates to apparatus for performing the operations herein.
  • This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer.
  • a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
  • a machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer).
  • a machine- readable medium includes read only memory ("ROM”); random access memory (“RAM”);
  • magnetic disk storage media optical storage media; flash memory devices; etc.
  • Embodiments of the present invention make use of erasure coding techniques to eliminate the tail performers in key-value based storage systems.
  • requests arriving into the system are classified into different categories, where each category C is specified by a four-tuple ⁇ object size Sc, block size Be, redundancy parameter Mc, type write/read>, depending on the object size (e.g., in bytes), Quality of Service (QoS) delay requirement, and whether it is a put/write request or a get/read request.
  • all requests belonging to the same category C have identical object size Sc (possibly after padding) and require the same type of operation (write or read). In one embodiment, they share similar QoS delay requirements as well.
  • Sc and Be are fixed and hence Kc is fixed, but different categories may have different values of Kc.
  • the objects starting from the smallest index value to largest are given as input blocks to an erasure encoder, where Kc is referred as the dimension of the code.
  • the encoder then generates (Nc-Kc) output parity blocks of the same fixed size.
  • Mc + 1 is the maximum number of extra parity coded blocks allowed for category-C objects (i.e., Nc ⁇ Kc+Mc +1).
  • Nc is updated every time a new request of category C arrives. Adaptation of Nc for different categories is done independently.
  • the store client stores the original Kc source blocks and (Nc-Kc) parity blocks separately using Nc ordered unique keys in a storage facility (e.g., public storage facility, private storage facility, etc.).
  • a store client needs to put/write or get/read the large object, it sends Nc parallel put/write or get/read requests using unique keys for a subset of Nc source blocks and/or parity blocks associated with the large object.
  • the store client receives Kc valid responses to any subset of these Nc requests, it considers the operation as completed. If it was a get/read request, the store client reconstructs the original Kc smaller objects through erasure decoding.
  • the order of keys are used to determine the order of source blocks and parity blocks in the code word generated by the erasure encoder.
  • the erasure coding in the system is not used to increase storage reliability nor handle packet losses, but to improve the delay performance at low storage and communication overhead.
  • the value Nc represents the amount of redundancy from using erasure codes and it is used to maintain a robust balance between system throughput and potential delay improvement by using erasure codes.
  • the store client issues a minimal number of new put/write or get/read requests for a subset of Nc keys that are sufficient to recover all the objects in the originally requested set.
  • FIG. 3 is a block diagram of one embodiment of a storage system. Referring to
  • FIG. 3 in one embodiment, there are three main components to the architecture: an application 301, a key-value store client 302, and a distributed key-value store 303.
  • Application 301 is the consumer of the storage system.
  • Application 301 generates data to be stored in the backend storage (e.g., distributed key-value store 303) and downloads the data stored in the backend storage.
  • the backend storage e.g., distributed key-value store 303
  • Key-value store client 302 interfaces application 301 with the backend storage, namely distributed key-value store 303.
  • key-value store client 302 provides an API to application 301 to receive and respond back to the requests of application 301. These requests include read and write requests and responses.
  • the read request specifies a filename and the write request specifies a filename and the data object being stored.
  • the read response specifies a read response and the data object that was requested, and the write response specifies a response indicating that the data object has or has not been successfully stored in the backend storage.
  • key-value store client 302 uses APIs provided by the backend storage to issue subsequent requests to the backend storage in order to resolve requests from application 301 before responding back to application 301.
  • the read requests to key-value store 303 take the form Read ⁇ Key-l> and the write requests to key-value store 303 take the form Write ⁇ Key-l, value, metadata>, where Key-1 specifies the location in key-value store 303, "value" specifies the data object being written and "metadata" specifies metadata associated with the data object being stored.
  • the read responses from key- value store 303 take the form Read ⁇ response, value> and the write responses from key-value store 303 take the form Write ⁇ response>, where "response" specifies whether the operation was successfully performed, and "value” specifies the data object being read from key-value store 303.
  • the value corresponds to the encoded version of part of the data object, e.g., one of the N coded blocks.
  • the first K keys correspond to the uncoded sequence of K blocks of a data object and (K+l)th to Nth keys correspond to parity blocks associated with a data object.
  • the metadata is only read if it is not stored locally in memory or disk at key-value store client 302. As will be described in greater detail below, key- value store client 302 returns a response to application 301 after only receiving K successful read/write replies.
  • key-value store client 302 has its own local disk and in- memory cache to store data of application 301 and to resolve requests of application 301.
  • key-value store client 302 also models the cumulative distribution function of delays for different packet ranges with and without applying FEC.
  • key-value store client 302 is also responsible for parallelization of read/write requests with the distributed storage backend.
  • Distributed key-value store 303 is the distributed storage backend that provides
  • key-value store 303 provides persistent, highly available and durable storage. To accomplish this, key- value store 303 uses replication where multiple copies of the same object are stored in and accessed from different physical locations. In one embodiment, for increased durability with more storage efficiency, key-value store 303 uses FEC protection within (i.e., in conjunction with data striping) or across the data objects. Such features are transparent to application 301 as well as to key- value store client 302.
  • the processes performed by application 301 and key-value store client 302 run on the same physical machine. In another embodiment, they can be run on different physical machines and communicate directly or over a network.
  • Classifier 310, scheduler 320 and cloud performance monitor 330 are parts of key- value store client 302 and are used to specify how different categories of requests are handled and how to decide what FEC code (or number of parallel read/write tasks) is used for different requests to accommodate different arrival rates as well as different requests compositions.
  • Figure 6 is a flow diagram of one embodiment of a process for read request handling by the request handler
  • Figure 7 is a flow diagram of one embodiment of a process for write request handling by the request handler.
  • the processes are performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), firmware or a combination of two or more of them.
  • processing logic may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), firmware or a combination of two or more of them.
  • Request handler 400 extracts the unique object ID of the requested object ObjlD(Oi) from the incoming message itself.
  • the incoming message is stored in request queue 300 of Figure 3.
  • this first operation corresponds to processing block 600.
  • request handler 400 issues a mapping service using interface 421 to location mapper 410 with the unique ID of the requested object. If the object is locally stored (e.g., in- memory cache or local disk), then request handler 400 retrieves the data from local storage (not shown in Figure 4) and sends the object to user application 301. These operations correspond to processing blocks 610 and 691 in Figure 6.
  • request handler 400 retrieves a tuple in the form of ⁇ i,N,K ⁇ received from scheduler 320
  • Location mapper 410 returns an ordered set of keys (Keyi,... , KeyK+Mc + i) corresponding to the object (Oi) and which data store to be used per key (processing block 621 in Figure 6). This ordered set of keys points to the source blocks and parity blocks for the requested object.
  • Request handler 400 selects any subset S of N keys in ⁇ Key 1 , ... , Key K +Mc+i ⁇ ⁇ This corresponds to processing block 630 in Figure 6.
  • request handler 400 prepares parallel read tasks (processing block 640 in Figure 6), where each task is for one unique key corresponding to a source or parity block. In one embodiment, each task is self-descriptive in the sense that which cloud location should be used for the job is included in its job description. All the parallel read tasks corresponding to the same object are passed as a batch to task queue 440. In one embodiment, the jobs in a batch are not interleaved with the jobs that belong to other batches.
  • interface 404 serves two purposes: (i) passing the actual tasks and (ii) passing tasks or batch attributes.
  • request handler 400 can cancel an individual task or all the tasks of a given batch by changing the task or batch attribute to "cancelled". If the task is still in its queue, task queue 440 deletes the task. Otherwise, task queue 440 issues an abort command to the thread processing the task.
  • Each of worker threads 450 and 460 serve one task at a time and when their current task finishes (handed to request handler 400 to execute processing block 650 in Figure 6), they ask for a new task from task queue 440.
  • Task queue 440 hands the head of line task to the requesting worker thread.
  • worker threads 450 and 460 can access to different cloud locations using the task description. The APIs to access these different clouds can be different, and thus the location information dictates which cloud instance should be used and which API call has to be issued. If an abort command is received for an ongoing task, then the worker thread can cancel its current task, return task status to request handler 400, and request a new task.
  • request handler 400 passes the source blocks and parity blocks (K in total) of a given batch to the FEC decoder 430 as they are returned by a number of worker threads. If the returned block is a source block, it is also kept by request handler 400. As it is able to recover any missing source blocks (not yet received), FEC decoder 430 passes the recovered source blocks to request handler 400 (processing blocks 660 and 670 in Figure 6). 9. Once it receives all the source blocks of the requested object (processing blocks 680 and 690), request handler 400 sends the object Oi back to user application 301 (processing block 691) using interface 302.
  • request handler 400 issues a cancellation request to task queue 440 for the remaining jobs of the same batch (processing block 690), thereby cancelling the remaining jobs.
  • Request handler 400 extracts the unique ID i of the object (processing block 700 in Figure 7).
  • the object is locally cached/stored.
  • request handler 400 retrieves a tuple in the form of ⁇ i,N,K ⁇ received from the Scheduler (processing block 710 in Figure 7).
  • request handler 400 divides the object into K source blocks and asks FEC encoder 430 to generate N-K parity blocks
  • Request handler 400 generates an ordered set S of unique keys (Key l .., KeyN) to label each source block and parity block to be written as part of the same write operation.
  • this meta-data is persistently stored locally as well as tagged to the write tasks (i.e., public cloud will store the meta data as well).
  • Request handler 400 caches the excessive parity blocks and creates a new batch of tasks, where each task is a write request for a unique key in the newly generated ordered set. This batch of tasks is passed to task queue 440. These operations correspond to processing block 770 in Figure 7.
  • the jobs in a batch are not interleaved with the jobs that belong to other batches. The jobs can be interleaved if they are demoted to "background job” status. 6.
  • request handler 400 demotes an individual job or all the jobs of a given batch by changing the job or batch attribute to "background job", and then, higher priority jobs can be moved in front of these background jobs. Jobs of different batches that are demoted to background traffic are processed on a first come first serve basis. The change of attribute is done through the interface 404.
  • worker threads 450 and 460 serve one task at a time and when their current task finishes, they ask for a new task from task queue 440.
  • Task queue 440 hands the head of line task to the requesting worker thread.
  • Worker threads 450 and 460 can access to different cloud locations using the task description. The APIs to access these different clouds can be different, thus location information dictates which cloud instance should be used and which API call has to be issued.
  • Request handler 400 listens to the number of successful write responses (e.g., ACKs or Acknowledgements) from worker threads 450 and 460 of a given batch. After receiving sufficient number (K) of successful write responses (i.e., ACKs), request handler 400 sends a success response (e.g., ACK) back to user application 301 that originally issued the write request.
  • K sufficient number
  • request handler 400 sends a success response (e.g., ACK) back to user application 301 that originally issued the write request.
  • success response e.g., ACK
  • request queue with respective to adapting the FEC coding, request queue
  • User application 301 is the consumer of the storage system. It generates data to be stored in the backend storage and it downloads the data stored in the backend storage.
  • request queue 300 is the module that buffers read or write requests received from the application but have not been processed by the key- value store client yet.
  • Request queue 300 provides API to classifier 310 to pull meta information of requests, such as, for example, the type of operation (get/read or put/write), object/file size, QoS requirement, etc.
  • Request queue 300 also provides API to scheduler 320 to pull system backlog information.
  • the system backlog information includes an instantaneous number of requests waiting in request queue 300, or a moving average of the number of requests waiting in request queue 300 upon the 10 most recent request arrivals.
  • classifier 410 is the module that assigns requests to
  • each category C is specified by a four-tuple ⁇ Object size Sc, Block size Be, Redundancy parameter Mc, Type write/read>.
  • Mc captures the amount of extra storage and communication cost allowed for category-C requests: Mc+1 is the maximum number of extra parity blocks available for category-C objects. The larger Mc is, the more parallel tasks can be created for a request and hence it is more likely to receive lower delay. For this reason, in one embodiment, requests with lower delay requirements are assigned to categories with larger value of Mc.
  • scheduler 320 is the module that determines the value of Nc for each request based on (1) category C the request belongs to, (2) backlog information provided by request queue 300, and (3) delay statistics provided by cloud performance monitor 330. In one embodiment, scheduler 320 then passes the tuple ⁇ Nc,Kc ⁇ to the request handler.
  • the backlog information of request queue 300 denoted as Q, is the instantaneous backlog size, i.e., the number of requests waiting in request queue 300. In another embodiment, Q is a moving average of the number of backlog size.
  • cloud performance monitor (CPM) 330 is a process that
  • CPM 330 collects logs of delays for tasks belonging to different categories through API provided by the key-value store client. CPM 330 processes these logs to provide statistics for delay performance of different task types and object sizes, which are used to during the procedure of determining Nc. In one embodiment, the statistics CPM 330 provides are the mean and standard deviation of the collected delays of task of each category. In another embodiment, the delay of each category is associated with a statistical model determined in advance. This may be done in a manner well-known in the art. One way to construct such a model is to gather delay measurements and look into the CDF. Then a delay model may be selected manually. For example, the delay can be modeled as a non-negative fixed constant ⁇ plus an exponential random variable X. Another way to generate the model may be to use data mining techniques.
  • CPM 330 computes model parameters that fit the collected delays best (possibly using machine learning techniques) and provides these model parameters to scheduler 320 as delay statistics. In another embodiment, no statistical model is given a priori and instead data mining techniques are used to discover structure in the logged delays and provide the models on-the-fly.
  • key-value store client 302 including request queue 300, classifier 310, scheduler 320, and CPM 330, interfaces user application 301 with the backend storage. In one embodiment, the key-value store client provides an API to user application 301 to receive and respond back to requests of user application 301.
  • key-value store client 302 creates subsequent read or write requests for smaller objects using FEC. For purposes herein, such requests are referred to as tasks in order to distinguish them from the original requests issued by user application 301. In one embodiment, tasks are served by worker threads 450 and 460 in order to resolve requests from user application 301. Then key-value store client 302 responds back to user application 301. In one embodiment, key-value store client 302 has its own local disk and in- memory cache to store data of user application 301 and to resolve requests of user application 301. In one embodiment, key-value store client 302 is responsible for parallelization of read/write requests with the distributed storage backend. In one embodiment, key-value store client 302 is also responsible of collecting delays for tasks belonging to different categories and provides delay statistics to CPM 330 through the corresponding API.
  • worker threads 450 and 460 are the units that carry out read or write tasks with the backend storage using the provided APIs. In one
  • each thread can read or write one object from or to the backend storage system at a time.
  • a thread is active if it is in the processing of serving a read or write operation with the backend storage and waiting for the response. Otherwise, the thread is not serving any task and is said to be idle. Also, if Kc tasks for the same category-C request have been completed, any on-going task for the same request can be terminated preemptively and the corresponding thread becomes idle.
  • the thread passes the response to key-value store client 302. •
  • task queue 440 is a module that buffers read or write tasks generated by key-value store client 302 but have not been served by any of worker threads 450 and 460 yet.
  • distributed key-value store 303 e.g., private cloud 470, public cloud 490, etc.
  • key-value store 303 provides persistent, highly available and durable storage. To accomplish this, it uses replication where multiple copies of the same object are stored in and accessed from different physical locations. For increased durability with more storage efficiency, the store itself can use FEC protection within (i.e., in conjunction with data striping) or across the data objects. Such features are transparent to the application as well as to key-value store client 302 disclosed in this invention.
  • all components except for distributed key-value store 303 run on the same physical machine. In another embodiment, they can be run on different physical machines and communicate over a network.
  • key-value store client 302 assigns categories to application requests and determining the FEC code used to serve each application request.
  • Figure 5 is a flow diagram of one embodiment of a process for request handling performed by a classifier and a scheduler. The process in Figure 5 is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), firmware, or a combination of them.
  • Request queue 300 extracts the unique request ID i, type of operation (read or write) and delay target D of the object from the incoming request. This information is then sent to classifier 310 through interface 402. If the type of operation is read, request queue 300 also extracts the unique object ID ObjlD(Oi) and sends it to classifier 310. If the type of operation is write, request queue 300 also sends the size of the object size(Oi) to classifier 310. In Figure 5, these operations correspond to processing blocks 500 and 510. 2. Classifier 310 determines which category C the request for object Oi belongs to.
  • category C is chosen such that it matches the way Oi is stored in the system (i.e., matching Sc, Be, Mc). If the request is to write the object, then classifier 310 picks a category C such that Sc ⁇ size(Oi). Once the category C is decided, classifier 310 passes a tuple ⁇ request id i, category C ⁇ to Scheduler (220) through interface 41 1. In Figure 5, this operation corresponds to processing block 520.
  • scheduler 320 Upon receiving a tuple ⁇ request id i, category C ⁇ , scheduler 320 requests the
  • Q can be the instantaneous backlog size, i.e. the number of requests buffered in request queue 300.
  • Q can be a moving average of the backlog size.
  • Q can be a vector specifying the number of read and write requests of different object sizes.
  • scheduler 320 passes a tuple ⁇ request id i, Nc, Kc ⁇ to request handler 400 using interface 441. These method steps correspond to processing blocks 530, 540 and 550 in Figure 5.
  • one implementation of Fc is thresholding: each category C is associated with a set of M C +1 thresholds ⁇ T c ,o, T c ,i, T C ,M C ⁇ such that T c ,o > T c ,i > . . . > T C ,M C > 0.
  • Q is the backlog size (e.g. instantaneous or moving averaged). Then Fc(Q) equals
  • Figure 9 illustrates an example of the thresholding Fc() functions, with two categories R (read) and W (write). NR is decided based on which range between the thresholds ⁇ TRJ ⁇ Q falls into. Similar for N w with thresholds ⁇ T Wji ⁇ .
  • Figure 10 is a flow diagram of one embodiment of a process for deciding Nc given a set of Mc+1 thresholds as described above.
  • the process in Figure 10 is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
  • scheduler 320 upon receiving the information of a category-C request for object O from classifier 310 (processing block 1000), reads the set of thresholds ⁇ Tc,o, .. ⁇ , C,M C ⁇ and Mc associated with category C (processing block 1010). Scheduler 320 also read the latest backlog statistic Q from request queue 300 (processing block 1020).
  • Cloud Performance Monitor 330 provides information to scheduler 320 for determining and adjusting Fc() for each category according to delay statistics it collects from request handler 400.
  • worker threads 450 and 460 create a log for successfully completed tasks with information on object size, task type (read or write), sending time, cloud location, and round trip time delay (i.e., from the time the job is scheduled until the time a successful response is received).
  • Figure 1 1 shows how CPM logs this
  • CPM 330 processes these logs to provide statistics for delay performance of different task types and object sizes, which are used to for determining Fc() functions.
  • the processing can be computing the mean and standard deviation of the delay for each task type and object size. This is in fact what was done in the example of Figure 13.
  • the thresholds for the thresholding Fc() functions described above are found in the following way.
  • the per-task round trip time delay for each category C is model by a random variable in the form of Ac+Xc, where Ac is a nonnegative constant and Xc is an exponentially distributed random variable with mean ⁇ / ⁇ Suppose only requests of category C arrives and the arrival follows a Poisson process of rate ⁇ .
  • request handler 400 fetches a request from request queue 300 if and only if task queue 440 becomes empty and at least one of worker threads 450 and 460 is idle.
  • D i>queue X the expected time a request spends in request queue 300
  • Di >serv i ce the time between a request is fetched by request handler 400 and it is completed and responded to the application.
  • Figure 12 is a flow diagram of one embodiment for computing the set of thresholds for F c () according to the above description.
  • the process is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
  • CPM 330 first reads the values of Kc and Mc from specification of category C as well as estimates of Ac and ⁇ c computed from logs gathered from worker threads. This first step corresponds to block 1200 in Figure 12.
  • CPM 330 in order to determine the Fc() functions, CPM 330 requires knowledge of statistics of delay performance of the cloud storage systems.
  • delay performance statistics are collected in offline measurements a priori.
  • Fc() functions are determined a priori and used statically throughout the execution of the system.
  • delay statistics of the cloud storage systems are collected online and get updated every time a worker thread finishes serving one task and produces a new log entry accordingly.
  • CPM 330 needs to recompute Fc() functions once in a while in order to keep track of the variation in performance of the cloud storage system. Given that performance of the cloud storage system may change over time unpredictably, the online approach is preferred.
  • Figure 13 is a flow diagram of one embodiment of a process for performing online estimation of A and ⁇ for the thresholding Fc() described earlier, using exponential moving averaging.
  • the process is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
  • processing logic initializes two variables Ec and Vc to some nonnegative values. Also processing logic reads constants ac and ⁇ from system configuration. Both ac and Pc from are in range [0, 1]. This initial operation corresponds to processing block 1300 in Figure 13.
  • processing logic reads the round trip time delay d of that entry. This operation corresponds to block 1310 in Figure 13.
  • request queue 300 comprises a first input first output (FIFO) queue, where the read or write requests are buffered.
  • FIFO first input first output
  • request queue 300 can be implemented as a priority queue, in which requests with lower delay requirement are given strict priority and placed at the head of the request queue.
  • the head of the line request is removed from request queue 300 and transferred to request handler 400 through interface 401 , when a "fetch request" message is received from request handler 400. It is up to request handler 400 to decide when to fetch a request from request queue 300.
  • the preference is to fetch when task queue 440 becomes empty and at least one worker thread (450 and 460) is idle.
  • request queue 300 transfers the head of the line request to request handler 400 using interface 401 and removes it from the queue.
  • request handler 400 looks up a tuple in the form of ⁇ i,N,K ⁇ received from scheduler 320. If the request is to read an object, this information specifies that the requested object has been divided into K source blocks and at least N-K parity blocks have been generated. Then request handler 400 creates N read tasks, each for reading one of the source or parity blocks corresponding to the requested object. These tasks are then inserted into task queue 440. If the request is to write an object, request handler 400 divided the object into K source blocks and generates N-K parity blocks.
  • request handler 400 creates N write tasks, each for writing one of the source or parity blocks to the cloud storage system. These tasks are then inserted into task queue 440. As soon as any K of these tasks have completed, the original request is considered completed and request handler 400 sends a success response (e.g., ACK) to the application using interface 302. In the case the request is to read, the response contains the requested object obtained from FEC decoder 430 . Details of how request handler 400 serves read and write requests is given above.
  • task queue 440 comprises a first input first output (FIFO) queue, where the read or write task that belong to the same the application request are put in one batch with no interleaving with jobs that belong to the other FEC batches.
  • Individual worker threads serve one task at a time and when any thread becomes idle, it gets the task waiting at the head of the task queue.
  • Figure 8 depicts these parallel threads that execute read/write tasks and obtain new tasks from task queue 440 when they are not servicing the current task. When there is congestion, i.e., there are more tasks waiting in the task queue than the idle threads, the delay performance worsens.
  • requests with lower delay requirement are given strict priority and placed at the head of task queue 440.
  • some threads can be pooled together to serve only the high priority jobs or can be used in preemptive mode (i.e., low priority job is stopped or cancelled to serve the high priority job).
  • Figure 14 is a flow diagram of one embodiment of a process for storage controller such as a store client.
  • the process is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
  • the process begins by computing a set of thresholds for each category into which a request can be classified (processing block 1401). In one
  • the thresholds are based on a model of delay statistics of different types of portions of data objects.
  • Processing logic classifies the request into a category (processing block 1403).
  • classifying the request into the category is based on whether the request is for a write operation or a read operation, file size of the object, and size of the K portions.
  • Processing logic also divides the data object into K portions, where K is an integer
  • processing block 1404 and assigns a distinct key to each of the K portions (processing block 1404) and assigns a distinct key to each of the K portions (processing block 1404)
  • Process logic selects an FEC coding rate based on backlog associated with at least one queue (processing block 1406).
  • the at least one queue comprises a first queue into which requests to the key-value based storage are received.
  • selecting the FEC coding rate is based on the object's size.
  • selecting an FEC coding rate is based on Quality of Service (QOS) requirements associated with the request.
  • QOS Quality of Service
  • selecting the FEC coding rate comprises selecting N based on the category. In one embodiment, selecting the FEC rate comprises comparing a backlog statistic to one or more thresholds in the set of thresholds to determine N.
  • N has been computed as a function of a backlog statistic.
  • the backlog statistic is based on a number of download and upload jobs waiting to be started for at least one category. In one embodiment, the backlog statistic is based on a total number of download and upload jobs waiting to be started for all categories into which requests can be classified.
  • processing logic After selecting the FEC rate, processing logic applies FEC coding based on the
  • processing logic assigns a distinct key to each of
  • N blocks of data resulting from applying the erasure coding to the K portions (processing block 1408) and orders the keys assigned to the K portions and the keys assigned to the N blocks (processing block 1409).
  • processing logic sends the N blocks of data using separate transfers to the storage system (processing block 1410).
  • sending the N blocks of data over distinct connections to the storage system comprises sending at least two of the N blocks in parallel over two of the distinct connections.
  • sending the N blocks of data using N separate transfers to the storage system comprises sending all N blocks in parallel on separate connections to the key- value store, including cancelling any of the N separate transfers that haven't been completed successfully after K of the N separate transfers have completed successfully.
  • processing logic when the object is requested, processing logic generates a plurality of individual requests, where each request for requesting one of the N blocks of data from storage (processing block 1411), applies erasure decoding as each of N blocks are received (processing block 1412), cancels N-K requests that remain outstanding after receiving K out of N blocks (processing block 1413), and returns the object to a requester (processing block 1414).
  • Figure 15 depicts a block diagram of a storage gateway that may be used to access a backend storage system, such as a cloud-based storage system. Such access to the backend storage system may be over a network (e.g., wide-area network, local area network, internet, etc.).
  • a storage gateway the system can interface clients running user applications to backend storage systems. Such client may be coupled directly to the storage gateway or may
  • a network e.g., wide-area network, local area network, internet, etc.
  • the system depicted in Figure 15 may also be a client device that performed the operations described above or interacts with a storage gateway to read or write data objects.
  • the storage gateway of Figure 15 executes and performs the operations associated with the application of show in Figure 4.
  • storage gateway 1510 includes a bus 1512 to interconnect subsystems of storage gateway 1510, such as a processor 1514, a system memory 1517 (e.g., RAM, ROM, etc.), an input/output controller 1518, an external device, such as a display screen 1524 via display adapter 1526, serial ports 1528 and 1530, a keyboard 1532 (interfaced with a keyboard controller 1533), a storage interface 1534, a floppy disk drive 1537 operative to receive a floppy disk 1538, a host bus adapter (HBA) interface card 1535 A operative to connect with a Fibre Channel network 1590, a host bus adapter (HBA) interface card 1535B operative to connect to a SCSI bus 1539, and an optical disk drive 1540.
  • HBA host bus adapter
  • HBA host bus adapter
  • mouse 1546 or other point-and-click device, coupled to bus 1512 via serial port 1528
  • modem 1547 coupled to bus 1512 via serial port 1530
  • network interface 1548 coupled directly to bus 1512.
  • Bus 1512 allows data communication between central processor 1514 and system memory 1517.
  • System memory 1517 e.g., RAM
  • the ROM or flash memory can contain, among other code, the Basic Input-Output system (BIOS) which controls basic hardware operation such as the interaction with peripheral components.
  • BIOS Basic Input-Output system
  • Applications resident with computer system 1510 are generally stored on and accessed via a computer readable medium, such as a hard disk drive (e.g., fixed disk 1544), an optical drive (e.g., optical drive 1540), a floppy disk unit 1537, or other storage medium.
  • Storage interface 1534 as with the other storage interfaces of computer system
  • Fixed disk drive 1544 may be a part of computer system 1510 or may be separate and accessed through other interface systems.
  • Modem 1547 may provide a direct connection to a backend storage system or a client via a telephone link or to the Internet via an internet service provider (ISP).
  • Network interface 1548 may provide a direct connection to a backend storage system and/or a client.
  • Network interface 1548 may provide a direct connection to a backend storage system and/or a client via a direct network link to the Internet via a POP (point of presence).
  • Network interface 1548 may provide such connection using wireless techniques, including digital cellular telephone connection, a packet connection, digital satellite data connection or the like.
  • Code to implement the storage gateway operations described herein can be stored in computer-readable storage media such as one or more of system memory 1517, fixed disk 1544, optical disk 1542, or floppy disk 1538.
  • the operating system provided on computer system 1510 may be MS-DOS®, MS-WINDOWS®, OS/2®, UNIX®, Linux®, or another known operating system.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and apparatus is disclosed herein for code length adaptation for access to key-value based storage systems. In one embodiment, the method comprises receiving a data object and a request; dividing the data object into K portions, where K is an integer; selecting an FEC coding rate based on backlog associated with at least one queue; applying FEC coding based on the FEC rate set to the K portions to create N FEC coded data blocks, where N is an integer greater than or equal to K; and sending the N FEC coded data blocks to the storage system.

Description

A ETHOD AND APPARATUS FOR CODE LENGTH ADAPTATION FOR ACCESS TO KEY-VALUE BASED CLOUD STORAGE SYSTEMS
PRIORITY
[0001] The present patent application claims priority to and incorporates by reference the corresponding provisional patent application serial no. 61/733,339, titled, "A Method and
Apparatus for Code Length Adaptation for Low Delay Access to Key-Value Based Cloud Storage Systems Using FEC Coding Techniques," filed on December 4, 2012.
FIELD OF THE INVENTION
[0002] Embodiments of the present invention relate to the field of storage systems; more particularly, embodiments of the present invention relate to the use of forward error correction (FEC) in the storage and retrieval of objects in storage systems.
BACKGROUND OF THE INVENTION
[0003] In public clouds such as Amazon's S3, the delay for a single read or write operation for small objects (e.g., less than or equal to 1 Kbyte) can be 100s of milliseconds of delay while for medium size objects (e.g., > 1Mbyte) delays can become in the order of seconds at 99th and 99.9th percentiles. For cascaded operations where one transaction needs many reads and writes to the same storage facility, these delays can be unacceptably high. For video content that consists of many megabytes, how to use S3 type storage as the video archive while attaining small startup delays and no pauses for video playback also has become a critical issue.
[0004] Recently, the work "Codes Can Reduce Queueing Delay in Data Centers", appeared in IEEE ISIT 2012, and the work "Erasure Coding in Windows Azure Storage", appeared in USENIX ATC 2012. Figure 1 illustrates the system proposed in these papers for read-only scenario. Every file to be read is first divided into K equal-sized blocks and encoded into N coded blocks with a (N,K) FEC code. There are N servers and every server stores one different coded block for each file. To serve a file-read request, a request dispatcher issues K read operations to the first K servers that become idle for the K coded blocks stored on these servers. These K servers are kept active until all read operations for all K coded blocks have completed, and then they become idle again and can serve other file-read requests. The dispatcher then performs FEC decoding to recover the original file from the K coded blocks. In the system, every file is coded with a fixed (N,K) FEC code. In the first paper, every request is served by the minimum number of exactly K parallel read operations (from K servers), i.e., zero overhead is introduced. In the second paper, if a request is directed to read a coded chunk stored on a hot (heavily loaded) node, in parallel, they read extra data from other servers, try to reconstruct the chunk stored on the hot node, and provide that to the client. Thus, they use FEC for storage durability/availability purposes while still trying to minimize the amount of data to be read.
[0005] In content delivery systems with network coding, in which multiple unicasting, multicasting, or broadcasting sessions compete for network capacity, a common goal is to allocate network capacity for different sessions such that certain utility function (e.g., total throughput and weighted sum of logarithmic throughput) is to be maximized. A representative picture for this is shown in Figure 2. Referring to Figure 2, there are two multicasting sessions: SI to {D1,D2} (curved arrows 201 and 202) and S2 to {D3,D4} (curved arrows in 203 and 204) that compete for network capacity, in particular the capacity of link R1→R2. The utility is usually modeled as a concave function of the throughput received by each session, which is in term determined by how much link capacity is allocated for that session on every link in the communication network. The system designer has to allocate link capacities for each session so that the overall utility is maximized, and control the unicast/multicast/broadcast rate for each session so that the amount of traffic injected conforms to the allocated link capacities. In such systems, throughput is the only concern, and although coding is also used, it is used merely to achieve multicasting capacity for each session given the link capacity allocation. As a result, there is zero redundancy when using network coding.
SUMMARY OF THE INVENTION
[0006] A method and apparatus is disclosed herein for code length adaptation for access to key-value based storage systems. In one embodiment, the method comprises receiving a data object and a request; dividing the data object into K portions, where K is an integer; selecting an FEC coding rate based on backlog associated with at least one queue; applying FEC coding based on the FEC rate set to the K portions to create N FEC coded data blocks, where N is an integer greater than or equal to K; and sending the N FEC coded data blocks to the storage system.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] The present invention will be understood more fully from the detailed description given below and from the accompanying drawings of various embodiments of the invention, which, however, should not be taken to limit the invention to the specific embodiments, but are for explanation and understanding only.
Figure 1 illustrates a prior art storage arrangement that uses FEC to reduce queuing delay in data centers.
Figure 2 illustrates an example of multiple multicasting sessions compete for network capacity.
Figure 3 is a block diagram of one embodiment of a storage system.
Figure 4 is a block diagram of one embodiment of an application executed by a store client.
Figure 5 is a flow diagram of one embodiment of a process for request handling performed by a classifier and a scheduler.
Figure 6 is a flow diagram of one embodiment of a process for read request handling by the request handler.
Figure 7 is a flow diagram of one embodiment of a process for write request handling by the request handler.
Figure 8 illustrates parallel threads that execute read/write (R/W) tasks obtain new tasks from a task queue when they are done servicing a current task.
Figure 9 illustrates an example of thresholding Fc() functions, with two categories R (read) and W (write).
Figure 10 is a flow diagram of one embodiment of a process for determining Nc given a set of thresholds.
Figure 11 illustrates raw data in terms of delay performance for different cloud locations are stored in a database.
Figure 12 is a flow diagram of one embodiment of a process for computing thresholds for a category C.
Figure 13 is a flow diagram of one embodiment of a process for estimating Ac and μα in the online fashion.
Figure 14 is a flow diagram of one embodiment of a process for storage controller such as a store client.
Figure 15 depicts a block diagram of a storage gateway or a client device.
DETAILED DESCRIPTION OF THE PRESENT INVENTION [0008] Embodiments of the present invention include methods and apparatuses that adaptively adjust the level of code redundancy to provide robust throughput-delay tradeoff when using FEC code for delay improvement in storing and retrieving data objects including videos, images, documents, meta-data, etc. in public cloud-based storage such as, for example, Amazon S3 or in private cloud-based storage systems (potentially of different size and different delay requirements). At low system utilization level, using FEC is beneficial because the time a request being served is significantly reduced by parallelism. On the other hand, at high system utilization level, using FEC is detrimental because it creates redundant write or read requests which further increases system utilization and causes requests spending significantly more time waiting to be served. Embodiments of the present invention adapt the FEC rate (including no coding) used by different categories of requests according to the backlog size, so that the overall delay
performance is optimized for all levels of system utilization as well as all possible compositions of requests arrivals.
[0009] The techniques described herein can be used by the host devices where data is produced and/or consumed as well as by proxy nodes that sits between the host devices and public storage facility. A public storage facility is accessed through using their API that opens connections between the API client (host or proxy nodes) and API server (residing in the storage facility). Through the API, clients can issue put, get, delete, copy, list, etc. requests where appropriate providing security credentials, local keys and global names to uniquely identify the objects, byte strings that represent the object, etc. Although clients are agnostic to how their requests are operationally carried out within the public cloud, they are sensitive to end to end delays incurred in resolving their requests. Measurement studies indicate that even when there is enough network bandwidth and the clients are very close to the storage nodes, there are substantial long tails in delay performance distributions with bottom 1% and 0.1% delay profiles observing much worse delays than the average performance. Measurements studies also indicate that the delay performances of parallel requests on different keys are weakly correlated.
[0010] Embodiments of the present invention use multiple categories of request and each category may use different FEC codes with different code dimension Kc. Requests of category C can be served by (Nc-Kc) redundant read or write operations, in addition to the minimum Kc ones. Moreover, in one embodiment, even within the same category C, different requests may be served by a different number of read or write operations, since Nc is a time varying parameter updated on a per-request basis. A feature of embodiments of the present invention is that it allows multiple categories of requests with various ranges of object sizes and delay distributions, and the amount of extra overhead for each category (governed by Nc) is adapted independently based on the system backlog. When the user application targets a particular delay
performance(e.g., a video streaming application requiring a low delay), embodiments of the present invention select an appropriate category for that client's requests. The number of redundant read or write operations for requests of different categories are then adjusted independently to deliver the performance.
[0011] In the following description, numerous details are set forth to provide a more thorough explanation of the present invention. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.
[0012] Some portions of the detailed descriptions which follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
[0013] It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
[0014] The present invention also relates to apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
[0015] The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
[0016] A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine- readable medium includes read only memory ("ROM"); random access memory ("RAM");
magnetic disk storage media; optical storage media; flash memory devices; etc.
Overview
[0017] Embodiments of the present invention make use of erasure coding techniques to eliminate the tail performers in key-value based storage systems. In one embodiment, requests arriving into the system are classified into different categories, where each category C is specified by a four-tuple <object size Sc, block size Be, redundancy parameter Mc, type write/read>, depending on the object size (e.g., in bytes), Quality of Service (QoS) delay requirement, and whether it is a put/write request or a get/read request. In one embodiment, all requests belonging to the same category C have identical object size Sc (possibly after padding) and require the same type of operation (write or read). In one embodiment, they share similar QoS delay requirements as well.
[0018] Every object corresponding to a category-C request is divided into smaller objects of size BC to create an ordered set of Kc=Sc/Bc smaller objects. In one embodiment, for a given category C, Sc and Be are fixed and hence Kc is fixed, but different categories may have different values of Kc. The objects starting from the smallest index value to largest are given as input blocks to an erasure encoder, where Kc is referred as the dimension of the code. The encoder then generates (Nc-Kc) output parity blocks of the same fixed size. In one embodiment, Nc is a tunable parameter determined as a function Fc of the number of backlogged requests Q, i.e., Nc=Fc(Q). Mc + 1 is the maximum number of extra parity coded blocks allowed for category-C objects (i.e., Nc≤ Kc+Mc +1). In one embodiment, Nc is updated every time a new request of category C arrives. Adaptation of Nc for different categories is done independently.
[0019] The store client stores the original Kc source blocks and (Nc-Kc) parity blocks separately using Nc ordered unique keys in a storage facility (e.g., public storage facility, private storage facility, etc.). When a store client needs to put/write or get/read the large object, it sends Nc parallel put/write or get/read requests using unique keys for a subset of Nc source blocks and/or parity blocks associated with the large object. When the store client receives Kc valid responses to any subset of these Nc requests, it considers the operation as completed. If it was a get/read request, the store client reconstructs the original Kc smaller objects through erasure decoding. In reconstruction, the order of keys are used to determine the order of source blocks and parity blocks in the code word generated by the erasure encoder. The erasure coding in the system is not used to increase storage reliability nor handle packet losses, but to improve the delay performance at low storage and communication overhead. The value Nc represents the amount of redundancy from using erasure codes and it is used to maintain a robust balance between system throughput and potential delay improvement by using erasure codes.
[0020] In one embodiment, when the earliest Kc responses get delayed over a dynamically or statically determined delay threshold, the store client issues a minimal number of new put/write or get/read requests for a subset of Nc keys that are sufficient to recover all the objects in the originally requested set.
[0021] Figure 3 is a block diagram of one embodiment of a storage system. Referring to
Figure 3, in one embodiment, there are three main components to the architecture: an application 301, a key-value store client 302, and a distributed key-value store 303.
[0022] Application 301 is the consumer of the storage system. Application 301 generates data to be stored in the backend storage (e.g., distributed key-value store 303) and downloads the data stored in the backend storage.
[0023] Key-value store client 302 interfaces application 301 with the backend storage, namely distributed key-value store 303. In one embodiment, key-value store client 302 provides an API to application 301 to receive and respond back to the requests of application 301. These requests include read and write requests and responses. In one embodiment, the read request specifies a filename and the write request specifies a filename and the data object being stored. In one embodiment, the read response specifies a read response and the data object that was requested, and the write response specifies a response indicating that the data object has or has not been successfully stored in the backend storage. [0024] In one embodiment, key-value store client 302 uses APIs provided by the backend storage to issue subsequent requests to the backend storage in order to resolve requests from application 301 before responding back to application 301. In one embodiment, the read requests to key-value store 303 take the form Read<Key-l> and the write requests to key-value store 303 take the form Write<Key-l, value, metadata>, where Key-1 specifies the location in key-value store 303, "value" specifies the data object being written and "metadata" specifies metadata associated with the data object being stored. In one embodiment, the read responses from key- value store 303 take the form Read<response, value> and the write responses from key-value store 303 take the form Write<response>, where "response" specifies whether the operation was successfully performed, and "value" specifies the data object being read from key-value store 303. In the case of a "value" returned from or sent to key-value storage from the key-value store client, the value corresponds to the encoded version of part of the data object, e.g., one of the N coded blocks.
[0025] Note that in one embodiment, the first K keys correspond to the uncoded sequence of K blocks of a data object and (K+l)th to Nth keys correspond to parity blocks associated with a data object. Also note in one embodiment, the metadata is only read if it is not stored locally in memory or disk at key-value store client 302. As will be described in greater detail below, key- value store client 302 returns a response to application 301 after only receiving K successful read/write replies.
[0026] In one embodiment, key-value store client 302 has its own local disk and in- memory cache to store data of application 301 and to resolve requests of application 301. In one embodiment, key-value store client 302 also models the cumulative distribution function of delays for different packet ranges with and without applying FEC. In one embodiment, key-value store client 302 is also responsible for parallelization of read/write requests with the distributed storage backend.
[0027] Distributed key-value store 303 is the distributed storage backend that provides
APIs and/or libraries to the store client for operations such as writing, reading, deleting, copying objects (e.g., a sequence of opaque bytes). Typical examples of such storage backends include, but are not limited to, Amazon S3, Cassandra, DynamoDB, etc. In one embodiment, key-value store 303 provides persistent, highly available and durable storage. To accomplish this, key- value store 303 uses replication where multiple copies of the same object are stored in and accessed from different physical locations. In one embodiment, for increased durability with more storage efficiency, key-value store 303 uses FEC protection within (i.e., in conjunction with data striping) or across the data objects. Such features are transparent to application 301 as well as to key- value store client 302.
[0028] In one embodiment, the processes performed by application 301 and key-value store client 302 run on the same physical machine. In another embodiment, they can be run on different physical machines and communicate directly or over a network.
[0029] Classifier 310, scheduler 320 and cloud performance monitor 330 are parts of key- value store client 302 and are used to specify how different categories of requests are handled and how to decide what FEC code (or number of parallel read/write tasks) is used for different requests to accommodate different arrival rates as well as different requests compositions.
[0030] Figure 6 is a flow diagram of one embodiment of a process for read request handling by the request handler, and Figure 7 is a flow diagram of one embodiment of a process for write request handling by the request handler. The processes are performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), firmware or a combination of two or more of them. The operations depicted in Figures 6 and 7 will be described in conjunction with Figures 3 and 4.
[0031] After fetching a read request through interface 401, under one set of conditions
(i.e., normal conditions where no errors have been reported by the underlying cloud API), the following operations are performed:
1. Request handler 400 extracts the unique object ID of the requested object ObjlD(Oi) from the incoming message itself. The incoming message is stored in request queue 300 of Figure 3. In Figure 6, this first operation corresponds to processing block 600.
2. To determine where in the storage hierarchy the requested object is stored, request handler 400 issues a mapping service using interface 421 to location mapper 410 with the unique ID of the requested object. If the object is locally stored (e.g., in- memory cache or local disk), then request handler 400 retrieves the data from local storage (not shown in Figure 4) and sends the object to user application 301. These operations correspond to processing blocks 610 and 691 in Figure 6.
3. In one embodiment, if the requested object is not stored locally, request handler 400 retrieves a tuple in the form of {i,N,K} received from scheduler 320
(processing block 620 in Figure 6). Location mapper 410 returns an ordered set of keys (Keyi,... , KeyK+Mc+i) corresponding to the object (Oi) and which data store to be used per key (processing block 621 in Figure 6). This ordered set of keys points to the source blocks and parity blocks for the requested object.
Request handler 400 selects any subset S of N keys in {Key 1 , ... , KeyK+Mc+i } · This corresponds to processing block 630 in Figure 6.
In one embodiment, request handler 400 prepares parallel read tasks (processing block 640 in Figure 6), where each task is for one unique key corresponding to a source or parity block. In one embodiment, each task is self-descriptive in the sense that which cloud location should be used for the job is included in its job description. All the parallel read tasks corresponding to the same object are passed as a batch to task queue 440. In one embodiment, the jobs in a batch are not interleaved with the jobs that belong to other batches.
In one embodiment, interface 404 serves two purposes: (i) passing the actual tasks and (ii) passing tasks or batch attributes. In one embodiment, request handler 400 can cancel an individual task or all the tasks of a given batch by changing the task or batch attribute to "cancelled". If the task is still in its queue, task queue 440 deletes the task. Otherwise, task queue 440 issues an abort command to the thread processing the task.
Each of worker threads 450 and 460 serve one task at a time and when their current task finishes (handed to request handler 400 to execute processing block 650 in Figure 6), they ask for a new task from task queue 440. Task queue 440 hands the head of line task to the requesting worker thread. In one embodiment, worker threads 450 and 460 can access to different cloud locations using the task description. The APIs to access these different clouds can be different, and thus the location information dictates which cloud instance should be used and which API call has to be issued. If an abort command is received for an ongoing task, then the worker thread can cancel its current task, return task status to request handler 400, and request a new task.
If FEC is used, request handler 400 passes the source blocks and parity blocks (K in total) of a given batch to the FEC decoder 430 as they are returned by a number of worker threads. If the returned block is a source block, it is also kept by request handler 400. As it is able to recover any missing source blocks (not yet received), FEC decoder 430 passes the recovered source blocks to request handler 400 (processing blocks 660 and 670 in Figure 6). 9. Once it receives all the source blocks of the requested object (processing blocks 680 and 690), request handler 400 sends the object Oi back to user application 301 (processing block 691) using interface 302.
10. Once all the source blocks are recovered for a given batch, request handler 400 issues a cancellation request to task queue 440 for the remaining jobs of the same batch (processing block 690), thereby cancelling the remaining jobs.
[0032] After fetching a write request through interface 401, under a set of conditions
(e.g., normal conditions with no errors reported by the underlying cloud API), the following operations are performed:
1. Request handler 400 extracts the unique ID i of the object (processing block 700 in Figure 7). In one embodiment, the object is locally cached/stored.
2. In one embodiment, request handler 400 retrieves a tuple in the form of {i,N,K} received from the Scheduler (processing block 710 in Figure 7).
3. If FEC is to be utilized (i.e., K > 1), request handler 400 divides the object into K source blocks and asks FEC encoder 430 to generate N-K parity blocks
(processing blocks 720, 730 & 740 of Figure 7). If FEC is not used (i.e., K = 1), then a single unique key assignment (e.g., Keyi) is made, a single write job is issued, and the success result is sent back when the write job is completed successfully are the default set of operations performed (processing blocks 722, 724, 726, 790 of Figure 7).
4. Request handler 400 generates an ordered set S of unique keys (Keyl .., KeyN) to label each source block and parity block to be written as part of the same write operation. In one embodiment, this meta-data is persistently stored locally as well as tagged to the write tasks (i.e., public cloud will store the meta data as well). These operations are performed as part of processing blocks 750 and 760 in Figure 7.
5. Request handler 400 caches the excessive parity blocks and creates a new batch of tasks, where each task is a write request for a unique key in the newly generated ordered set. This batch of tasks is passed to task queue 440. These operations correspond to processing block 770 in Figure 7. In one embodiment, the jobs in a batch are not interleaved with the jobs that belong to other batches. The jobs can be interleaved if they are demoted to "background job" status. 6. In one embodiment, request handler 400 demotes an individual job or all the jobs of a given batch by changing the job or batch attribute to "background job", and then, higher priority jobs can be moved in front of these background jobs. Jobs of different batches that are demoted to background traffic are processed on a first come first serve basis. The change of attribute is done through the interface 404.
7. In one embodiment, worker threads 450 and 460 serve one task at a time and when their current task finishes, they ask for a new task from task queue 440. Task queue 440 hands the head of line task to the requesting worker thread. Worker threads 450 and 460 can access to different cloud locations using the task description. The APIs to access these different clouds can be different, thus location information dictates which cloud instance should be used and which API call has to be issued.
8. Request handler 400 listens to the number of successful write responses (e.g., ACKs or Acknowledgements) from worker threads 450 and 460 of a given batch. After receiving sufficient number (K) of successful write responses (i.e., ACKs), request handler 400 sends a success response (e.g., ACK) back to user application 301 that originally issued the write request. These operations correspond to processing blocks 780 and 790 of Figure 7. In one embodiment, request handler 400 demotes the remaining jobs in the same batch to background status by changing the job attributes through interface 404.
[0033] In one embodiment, with respective to adapting the FEC coding, request queue
400, classifier 310, schedule 320 and cloud performance monitor 330 are involved in the following process:
• User application 301 is the consumer of the storage system. It generates data to be stored in the backend storage and it downloads the data stored in the backend storage.
• In one embodiment, request queue 300 is the module that buffers read or write requests received from the application but have not been processed by the key- value store client yet. Request queue 300 provides API to classifier 310 to pull meta information of requests, such as, for example, the type of operation (get/read or put/write), object/file size, QoS requirement, etc. Request queue 300 also provides API to scheduler 320 to pull system backlog information. In one embodiment, the system backlog information includes an instantaneous number of requests waiting in request queue 300, or a moving average of the number of requests waiting in request queue 300 upon the 10 most recent request arrivals.
• In one embodiment, classifier 410 is the module that assigns requests to
corresponding categories, depending on the object size (e.g., in bytes), Quality of Service (QoS) delay requirement, and whether it is a put/write request or a get/read request. In one embodiment, each category C is specified by a four-tuple <Object size Sc, Block size Be, Redundancy parameter Mc, Type write/read>. Once a request is assigned to category C, C and the corresponding parameter Kc is passed to scheduler 320. Parameter Mc captures the amount of extra storage and communication cost allowed for category-C requests: Mc+1 is the maximum number of extra parity blocks available for category-C objects. The larger Mc is, the more parallel tasks can be created for a request and hence it is more likely to receive lower delay. For this reason, in one embodiment, requests with lower delay requirements are assigned to categories with larger value of Mc.
• In one embodiment, scheduler 320 is the module that determines the value of Nc for each request based on (1) category C the request belongs to, (2) backlog information provided by request queue 300, and (3) delay statistics provided by cloud performance monitor 330. In one embodiment, scheduler 320 then passes the tuple {Nc,Kc} to the request handler. In one embodiment, the backlog information of request queue 300, denoted as Q, is the instantaneous backlog size, i.e., the number of requests waiting in request queue 300. In another embodiment, Q is a moving average of the number of backlog size.
• In one embodiment, cloud performance monitor (CPM) 330 is a process that
collects logs of delays for tasks belonging to different categories through API provided by the key-value store client. CPM 330 processes these logs to provide statistics for delay performance of different task types and object sizes, which are used to during the procedure of determining Nc. In one embodiment, the statistics CPM 330 provides are the mean and standard deviation of the collected delays of task of each category. In another embodiment, the delay of each category is associated with a statistical model determined in advance. This may be done in a manner well-known in the art. One way to construct such a model is to gather delay measurements and look into the CDF. Then a delay model may be selected manually. For example, the delay can be modeled as a non-negative fixed constant Δ plus an exponential random variable X. Another way to generate the model may be to use data mining techniques. In one embodiment, CPM 330 computes model parameters that fit the collected delays best (possibly using machine learning techniques) and provides these model parameters to scheduler 320 as delay statistics. In another embodiment, no statistical model is given a priori and instead data mining techniques are used to discover structure in the logged delays and provide the models on-the-fly.In one embodiment, key-value store client 302, including request queue 300, classifier 310, scheduler 320, and CPM 330, interfaces user application 301 with the backend storage. In one embodiment, the key-value store client provides an API to user application 301 to receive and respond back to requests of user application 301. According to the tuple {Nc, Kc} received from scheduler 320, key-value store client 302 creates subsequent read or write requests for smaller objects using FEC. For purposes herein, such requests are referred to as tasks in order to distinguish them from the original requests issued by user application 301. In one embodiment, tasks are served by worker threads 450 and 460 in order to resolve requests from user application 301. Then key-value store client 302 responds back to user application 301. In one embodiment, key-value store client 302 has its own local disk and in- memory cache to store data of user application 301 and to resolve requests of user application 301. In one embodiment, key-value store client 302 is responsible for parallelization of read/write requests with the distributed storage backend. In one embodiment, key-value store client 302 is also responsible of collecting delays for tasks belonging to different categories and provides delay statistics to CPM 330 through the corresponding API.
In one embodiment, worker threads 450 and 460 are the units that carry out read or write tasks with the backend storage using the provided APIs. In one
embodiment, each thread can read or write one object from or to the backend storage system at a time. A thread is active if it is in the processing of serving a read or write operation with the backend storage and waiting for the response. Otherwise, the thread is not serving any task and is said to be idle. Also, if Kc tasks for the same category-C request have been completed, any on-going task for the same request can be terminated preemptively and the corresponding thread becomes idle. Once a read or write operation is completed, the thread passes the response to key-value store client 302. • In one embodiment, task queue 440 is a module that buffers read or write tasks generated by key-value store client 302 but have not been served by any of worker threads 450 and 460 yet.
• In one embodiment, distributed key-value store 303 (e.g., private cloud 470, public cloud 490, etc.) is a distributed storage backend that provides APIs and/or libraries to the store client for operations such as writing, reading, deleting, copying objects (e.g., a sequence of opaque bytes). In one embodiment, key-value store 303 provides persistent, highly available and durable storage. To accomplish this, it uses replication where multiple copies of the same object are stored in and accessed from different physical locations. For increased durability with more storage efficiency, the store itself can use FEC protection within (i.e., in conjunction with data striping) or across the data objects. Such features are transparent to the application as well as to key-value store client 302 disclosed in this invention.
[0034] In one embodiment, all components except for distributed key-value store 303 run on the same physical machine. In another embodiment, they can be run on different physical machines and communicate over a network.
[0035] In one embodiment, key-value store client 302 assigns categories to application requests and determining the FEC code used to serve each application request. Figure 5 is a flow diagram of one embodiment of a process for request handling performed by a classifier and a scheduler. The process in Figure 5 is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), firmware, or a combination of them.
[0036] After receiving a read (or write) request through interface 350, under one set of conditions (e.g., normal conditions with no error reported by the underlying cloud API), the following operations are performed:
1. Request queue 300 extracts the unique request ID i, type of operation (read or write) and delay target D of the object from the incoming request. This information is then sent to classifier 310 through interface 402. If the type of operation is read, request queue 300 also extracts the unique object ID ObjlD(Oi) and sends it to classifier 310. If the type of operation is write, request queue 300 also sends the size of the object size(Oi) to classifier 310. In Figure 5, these operations correspond to processing blocks 500 and 510. 2. Classifier 310 determines which category C the request for object Oi belongs to. If the request is to read the object, Oi should have already been stored in the system with unique object id ObjlD(Oi), category C is chosen such that it matches the way Oi is stored in the system (i.e., matching Sc, Be, Mc). If the request is to write the object, then classifier 310 picks a category C such that Sc≥ size(Oi). Once the category C is decided, classifier 310 passes a tuple {request id i, category C} to Scheduler (220) through interface 41 1. In Figure 5, this operation corresponds to processing block 520.
3. Upon receiving a tuple {request id i, category C} , scheduler 320 requests the
queue backlog statistic information Q from request queue 300 using interface 403. In one embodiment, Q can be the instantaneous backlog size, i.e. the number of requests buffered in request queue 300. In another embodiment, Q can be a moving average of the backlog size. In another embodiment, Q can be a vector specifying the number of read and write requests of different object sizes. Then scheduler 320 computes Nc = Fc(Q), where Fc() is a function specified for category C that maps Q into an integer between (and including) Kc and Kc +Mc+1.
4. Then scheduler 320 passes a tuple {request id i, Nc, Kc} to request handler 400 using interface 441. These method steps correspond to processing blocks 530, 540 and 550 in Figure 5.
[0037] In one embodiment, one implementation of Fc is thresholding: each category C is associated with a set of MC+1 thresholds {Tc,o, Tc,i, TC,MC} such that Tc,o > Tc,i > . . . > TC,MC > 0. Q is the backlog size (e.g. instantaneous or moving averaged). Then Fc(Q) equals
- Kc, if Q > Tc,0;
- Kc+l , if Tc,o > Q > Tc,i;
- Kc+Mc, if Tc,Mc-l > Q > TC,MC;
Figure imgf000017_0001
[0038] Figure 9 illustrates an example of the thresholding Fc() functions, with two categories R (read) and W (write). NR is decided based on which range between the thresholds {TRJ} Q falls into. Similar for Nw with thresholds {TWji} .
[0039] Figure 10 is a flow diagram of one embodiment of a process for deciding Nc given a set of Mc+1 thresholds as described above. The process in Figure 10 is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
[0040] Referring to Figure 10, upon receiving the information of a category-C request for object O from classifier 310 (processing block 1000), scheduler 320 reads the set of thresholds {Tc,o, .. · , C,MC} and Mc associated with category C (processing block 1010). Scheduler 320 also read the latest backlog statistic Q from request queue 300 (processing block 1020).
[0041] Scheduler 320 then starts to compare Q with the thresholds in an increasing order of i=0, ... ,MC+1 (processing blocks 1030, 1040, 1050 and 1060). As soon as the first i < Mc such that Q > Tc,i is found or i becomes Mc+1 , it decides Nc = Kc+i (processing block 1070).
[0042] For achieving good delay-throughput performance, the choice of Fc() functions is crucial. Cloud Performance Monitor 330, referred to as CPM 330, provides information to scheduler 320 for determining and adjusting Fc() for each category according to delay statistics it collects from request handler 400. In one embodiment, worker threads 450 and 460 create a log for successfully completed tasks with information on object size, task type (read or write), sending time, cloud location, and round trip time delay (i.e., from the time the job is scheduled until the time a successful response is received). Figure 1 1 shows how CPM logs this
information in a table that is stored in a database. CPM 330 processes these logs to provide statistics for delay performance of different task types and object sizes, which are used to for determining Fc() functions. For example, the processing can be computing the mean and standard deviation of the delay for each task type and object size. This is in fact what was done in the example of Figure 13. In one embodiment, the thresholds for the thresholding Fc() functions described above are found in the following way. The per-task round trip time delay for each category C is model by a random variable in the form of Ac+Xc, where Ac is a nonnegative constant and Xc is an exponentially distributed random variable with mean Ι/μ Suppose only requests of category C arrives and the arrival follows a Poisson process of rate λ. Assume that Nc=Kc+i is fixed, and there are L parallel worker threads (450 and 460) in the system. Also assume that request handler 400 fetches a request from request queue 300 if and only if task queue 440 becomes empty and at least one of worker threads 450 and 460 is idle. Denote the expected time a request spends in request queue 300 as Di>queue X), and the time between a request is fetched by request handler 400 and it is completed and responded to the application as Di>service. These two quantities can be approximated by
Figure imgf000019_0001
where Ti = L/(NCAC + Κε). The total delay Di>tota )=Di>t,ueue(l)+ Di>service .
[0043] Then λι is solved for so that the equation
Figure imgf000019_0002
for z'=0, ... ,Mc.
This is equivalent to solving the following quadratic equation of λ:
Figure imgf000019_0003
[0044] Since this is a quadratic equation of single unknown, there is closed form solution for the roots. The smaller root is taken as λ,. In one embodiment, is the threshold of the arrival rate λ above which Nc=Kc+z produce a smaller total delay than Nc=Kc+ +l and below which Nc=Kc+ +l has a smaller total delay.
[0045] Figure 12 is a flow diagram of one embodiment for computing the set of thresholds for Fc() according to the above description. The process is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
[0046] Referring to Figure 12, CPM 330 first reads the values of Kc and Mc from specification of category C as well as estimates of Ac and μc computed from logs gathered from worker threads. This first step corresponds to block 1200 in Figure 12.
[0047] For i =0, ..., Mc, CPM 330 solves the quadratic equation E02 for λι. For every λι obtained, set threshold Tc,i = λ;
Figure imgf000019_0004
is computed according to equation E01. These method operations correspond to processing blocks 1210, 1220, 1230, 1240, 1250 and 1260 in Figure 12.
[0048] Once all Mc+1 thresholds are computed, update the set of thresholds {Tc,o , .. · ,
Tc,Mc} for category C (processing block 1270).
[0049] In one embodiment, in order to determine the Fc() functions, CPM 330 requires knowledge of statistics of delay performance of the cloud storage systems. In one embodiment, delay performance statistics are collected in offline measurements a priori. In this is the case, Fc() functions are determined a priori and used statically throughout the execution of the system. In another embodiment, delay statistics of the cloud storage systems are collected online and get updated every time a worker thread finishes serving one task and produces a new log entry accordingly. In this case, CPM 330 needs to recompute Fc() functions once in a while in order to keep track of the variation in performance of the cloud storage system. Given that performance of the cloud storage system may change over time unpredictably, the online approach is preferred.
[0050] Figure 13 is a flow diagram of one embodiment of a process for performing online estimation of A and μα for the thresholding Fc() described earlier, using exponential moving averaging. The process is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
[0051] Referring to Figure 13, at the start of the system, processing logic initializes two variables Ec and Vc to some nonnegative values. Also processing logic reads constants ac and βο from system configuration. Both ac and Pc from are in range [0, 1]. This initial operation corresponds to processing block 1300 in Figure 13.
[0052] Every time a new log entry is received from a worker thread of an object that matches the block size Be and type of operation (read or write) for category C, processing logic reads the round trip time delay d of that entry. This operation corresponds to block 1310 in Figure 13.
[0053] Processing logic updates Ec and Vc using equations Ec = (1- ac)Ec + ac d and
VC = (1- Pc)Vc + Pc d2. This operation corresponds to block 1320 in Figure 13.
[0054] Processing logic updates μc =(Vc - Ec2)-l/2 and Ac = Ec - Ι/μο- This operation corresponds to block 1330 in Figure 13.
[0055] In one embodiment, request queue 300 comprises a first input first output (FIFO) queue, where the read or write requests are buffered. In another embodiment, it can be implemented as a priority queue, in which requests with lower delay requirement are given strict priority and placed at the head of the request queue. The head of the line request is removed from request queue 300 and transferred to request handler 400 through interface 401 , when a "fetch request" message is received from request handler 400. It is up to request handler 400 to decide when to fetch a request from request queue 300. In one embodiment, the preference is to fetch when task queue 440 becomes empty and at least one worker thread (450 and 460) is idle. When fetched, request queue 300 transfers the head of the line request to request handler 400 using interface 401 and removes it from the queue. [0056] After fetching a request from request queue 300, request handler 400 looks up a tuple in the form of {i,N,K} received from scheduler 320. If the request is to read an object, this information specifies that the requested object has been divided into K source blocks and at least N-K parity blocks have been generated. Then request handler 400 creates N read tasks, each for reading one of the source or parity blocks corresponding to the requested object. These tasks are then inserted into task queue 440. If the request is to write an object, request handler 400 divided the object into K source blocks and generates N-K parity blocks. Then request handler 400 creates N write tasks, each for writing one of the source or parity blocks to the cloud storage system. These tasks are then inserted into task queue 440. As soon as any K of these tasks have completed, the original request is considered completed and request handler 400 sends a success response (e.g., ACK) to the application using interface 302. In the case the request is to read, the response contains the requested object obtained from FEC decoder 430 . Details of how request handler 400 serves read and write requests is given above.
[0057] In one embodiment, task queue 440 comprises a first input first output (FIFO) queue, where the read or write task that belong to the same the application request are put in one batch with no interleaving with jobs that belong to the other FEC batches. Individual worker threads serve one task at a time and when any thread becomes idle, it gets the task waiting at the head of the task queue. Figure 8 depicts these parallel threads that execute read/write tasks and obtain new tasks from task queue 440 when they are not servicing the current task. When there is congestion, i.e., there are more tasks waiting in the task queue than the idle threads, the delay performance worsens. For that reason, in another embodiment, requests with lower delay requirement (e.g., which use lower rate FEC codes) are given strict priority and placed at the head of task queue 440. In another embodiment, some threads can be pooled together to serve only the high priority jobs or can be used in preemptive mode (i.e., low priority job is stopped or cancelled to serve the high priority job).
[0058] Figure 14 is a flow diagram of one embodiment of a process for storage controller such as a store client.The process is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
[0059] Referring to Figure 14, the process begins by computing a set of thresholds for each category into which a request can be classified (processing block 1401). In one
embodiment, the thresholds are based on a model of delay statistics of different types of portions of data objects. [0060] Thereafter, processing logic receives a data object and a request (processing block
1402).
[0061] Processing logic classifies the request into a category (processing block 1403). In one embodiment, classifying the request into the category is based on whether the request is for a write operation or a read operation, file size of the object, and size of the K portions.
[0062] Processing logic also divides the data object into K portions, where K is an integer
(processing block 1404) and assigns a distinct key to each of the K portions (processing block
1405)
[0063] Process logic then selects an FEC coding rate based on backlog associated with at least one queue (processing block 1406). In one embodiment, the at least one queue comprises a first queue into which requests to the key-value based storage are received. In one embodiment, selecting the FEC coding rate is based on the object's size. In one embodiment, selecting an FEC coding rate is based on Quality of Service (QOS) requirements associated with the request.
[0064] In one embodiment, selecting the FEC coding rate comprises selecting N based on the category. In one embodiment, selecting the FEC rate comprises comparing a backlog statistic to one or more thresholds in the set of thresholds to determine N.
[0065] In one embodiment, N has been computed as a function of a backlog statistic. In one embodiment, the backlog statistic is based on a number of download and upload jobs waiting to be started for at least one category. In one embodiment, the backlog statistic is based on a total number of download and upload jobs waiting to be started for all categories into which requests can be classified.
[0066] After selecting the FEC rate, processing logic applies FEC coding based on the
FEC rate set to the K portions to create N FEC coded data blocks, where N is an integer greater than or equal to K (processing block 1407).
[0067] After applying the FEC coding, processing logic assigns a distinct key to each of
N blocks of data resulting from applying the erasure coding to the K portions (processing block 1408) and orders the keys assigned to the K portions and the keys assigned to the N blocks (processing block 1409).
[0068] After applying the erasure coding, processing logic sends the N blocks of data using separate transfers to the storage system (processing block 1410). In one embodiment, sending the N blocks of data over distinct connections to the storage system comprises sending at least two of the N blocks in parallel over two of the distinct connections.
[0069] In one embodiment, sending the N blocks of data using N separate transfers to the storage system comprises sending all N blocks in parallel on separate connections to the key- value store, including cancelling any of the N separate transfers that haven't been completed successfully after K of the N separate transfers have completed successfully.
[0070] Subsequently, when the object is requested, processing logic generates a plurality of individual requests, where each request for requesting one of the N blocks of data from storage (processing block 1411), applies erasure decoding as each of N blocks are received (processing block 1412), cancels N-K requests that remain outstanding after receiving K out of N blocks (processing block 1413), and returns the object to a requester (processing block 1414).
An Example of a System
[0071] Figure 15 depicts a block diagram of a storage gateway that may be used to access a backend storage system, such as a cloud-based storage system. Such access to the backend storage system may be over a network (e.g., wide-area network, local area network, internet, etc.). As a storage gateway, the system can interface clients running user applications to backend storage systems. Such client may be coupled directly to the storage gateway or may
communicate with the storage gateway over a network (e.g., wide-area network, local area network, internet, etc.). Note that the system depicted in Figure 15 may also be a client device that performed the operations described above or interacts with a storage gateway to read or write data objects.
[0072] In one embodiment, the storage gateway of Figure 15 executes and performs the operations associated with the application of show in Figure 4.
[0073] Referring to Figure 15, storage gateway 1510 includes a bus 1512 to interconnect subsystems of storage gateway 1510, such as a processor 1514, a system memory 1517 (e.g., RAM, ROM, etc.), an input/output controller 1518, an external device, such as a display screen 1524 via display adapter 1526, serial ports 1528 and 1530, a keyboard 1532 (interfaced with a keyboard controller 1533), a storage interface 1534, a floppy disk drive 1537 operative to receive a floppy disk 1538, a host bus adapter (HBA) interface card 1535 A operative to connect with a Fibre Channel network 1590, a host bus adapter (HBA) interface card 1535B operative to connect to a SCSI bus 1539, and an optical disk drive 1540. Also included are a mouse 1546 (or other point-and-click device, coupled to bus 1512 via serial port 1528), a modem 1547 (coupled to bus 1512 via serial port 1530), and a network interface 1548 (coupled directly to bus 1512).
[0074] Bus 1512 allows data communication between central processor 1514 and system memory 1517. System memory 1517 (e.g., RAM) may be generally the main memory into which the operating system and application programs are loaded. The ROM or flash memory can contain, among other code, the Basic Input-Output system (BIOS) which controls basic hardware operation such as the interaction with peripheral components. Applications resident with computer system 1510 are generally stored on and accessed via a computer readable medium, such as a hard disk drive (e.g., fixed disk 1544), an optical drive (e.g., optical drive 1540), a floppy disk unit 1537, or other storage medium.
[0075] Storage interface 1534, as with the other storage interfaces of computer system
1510, can connect to a standard computer readable medium for storage and/or retrieval of information, such as a fixed disk drive 1544. Fixed disk drive 1544 may be a part of computer system 1510 or may be separate and accessed through other interface systems.
[0076] Modem 1547 may provide a direct connection to a backend storage system or a client via a telephone link or to the Internet via an internet service provider (ISP). Network interface 1548 may provide a direct connection to a backend storage system and/or a client.
Network interface 1548 may provide a direct connection to a backend storage system and/or a client via a direct network link to the Internet via a POP (point of presence). Network interface 1548 may provide such connection using wireless techniques, including digital cellular telephone connection, a packet connection, digital satellite data connection or the like.
[0077] Many other devices or subsystems (not shown) may be connected in a similar manner (e.g., document scanners, digital cameras and so on). Conversely, all of the devices shown in Figure 15 need not be present to practice the techniques described herein. The devices and subsystems can be interconnected in different ways from that shown in Figure 15. The operation of a computer system such as that shown in Figure 15 is readily known in the art and is not discussed in detail in this application.
[0078] Code to implement the storage gateway operations described herein can be stored in computer-readable storage media such as one or more of system memory 1517, fixed disk 1544, optical disk 1542, or floppy disk 1538. The operating system provided on computer system 1510 may be MS-DOS®, MS-WINDOWS®, OS/2®, UNIX®, Linux®, or another known operating system.
[0079] Whereas many alterations and modifications of the present invention will no doubt become apparent to a person of ordinary skill in the art after having read the foregoing description, it is to be understood that any particular embodiment shown and described by way of illustration is in no way intended to be considered limiting. Therefore, references to details of various embodiments are not intended to limit the scope of the claims which in themselves recite only those features regarded as essential to the invention.

Claims

CLAIMS We claim:
1. A method for use in a key-value based storage system, the method comprising: receiving a data object and a request;
dividing the data object into K portions, where K is an integer;
selecting an FEC coding rate based on backlog associated with at least one queue;
applying FEC coding based on the FEC rate set to the K portions to create N FEC coded data blocks, where N is an integer greater than or equal to K; and
sending the N FEC coded data blocks to the storage system.
2. The method defined in Claim 1 wherein the at least one queue comprises a first queue into which requests to the key-value based storage are received.
3. The method defined in Claim 1 wherein selecting the FEC coding rate is based on the object's size.
4. The method defined in Claim 1 wherein selecting an FEC coding rate is based on Quality of Service (QOS) requirements associated with the request.
5. The method defined in Claim 1 further comprising classifying the request into a category, and wherein selecting the FEC coding rate comprises selecting N based on the category.
6. The method defined in Claim 5 wherein classifying the request into the category is based on whether the request is for a write operation or a read operation, file size of the object, and size of the K portions.
7. The method defined in Claim 5 wherein N has been computed as a function of at least one backlog statistic.
8. The method defined in Claim 7 wherein the at least one backlog statistic is based on a number of download and upload jobs waiting to be started for at least one category.
9. The method defined in Claim 8 wherein the at least one backlog statistic is based on a total number of download and upload jobs waiting to be started for all categories into which requests can be classified.
10. The method defined in Claim 6 further comprising computing a set of thresholds for each category into which a request can be classified and comparing a backlog statistic to one or more thresholds in the set of thresholds to determine N.
11. The method defined in Claim 10 wherein the thresholds are based on a model of delay statistics of different types of portions of data objects.
12. The method defined in Claim 1 further comprising:
assigning a distinct key to each of the K portions;
assigning a distinct key to each of N blocks of data resulting from applying the erasure coding to the K portions;
ordering the keys assigned to the K portions and the keys assigned to the N blocks; and wherein sending the N blocks of data using N separate transfers to the storage system comprises sending all N blocks in parallel on separate connections to the key-value store, including cancelling any of the N separate transfers that haven't been completed successfully after
K of the N separate transfers have completed successfully.
13. The method defined in Claim 1 further comprising:
generating a plurality of individual requests, each request for requesting one of the N blocks of data from storage;
applying erasure decoding as each of N blocks are received;
cancelling N-K requests that remain outstanding after receiving K out of N blocks; and returning the object to a requester.
14. An apparatus for use in a key-value based storage system, the apparatus comprising:
a communication interface for coupling to a network, the communication interface operable to receive a data object and a request from the network; oojjrwo 2014/088624 PCT/US2013/030926 a memory coupled to the communication interface to store the data object and the request; and
a processor coupled to the memory and the communication interface, the processor operable to
divide the data object into K portions, where K is an integer;
select an FEC coding rate based on backlog associated with at least one queue in the memory;
apply FEC coding based on the FEC rate set to the K portions to create N FEC coded data blocks, where N is an integer greater than or equal to K; and
send the N FEC coded data blocks to the storage system.
15. The apparatus defined in Claim 14 wherein the at least one queue comprises a first queue into which requests to the key-value based storage are stored upon receipt.
16. The apparatus defined in Claim 14 wherein the processor selects the FEC coding rate based on the object's size.
17. The apparatus defined in Claim 14 wherein the processor selects the FEC coding rate based on Quality of Service (QOS) requirements associated with the request.
18. The apparatus defined in Claim 14 wherein the processor comprises a classifier to classify the request into a category and selects N based on the category.
19. The apparatus defined in Claim 18 wherein the classifier classifies the request into the category based on whether the request is for a write operation or a read operation, file size of the object, and size of the K portions.
20. The apparatus defined in Claim 18 wherein N has been computed as a function of at least one backlog statistic.
21. The apparatus defined in Claim 20 wherein the at least one backlog statistic is based on a number of download and upload jobs waiting to be started for at least one category. oojjrwo 2014/088624 PCT/US2013/030926
22. The apparatus defined in Claim 21 wherein the at least one backlog statistic is based on a total number of download and upload jobs waiting to be started for all categories into which requests can be classified.
23. The apparatus defined in Claim 19 wherein the processor computes a set of thresholds for each category into which a request can be classified and compares a backlog statistic to one or more thresholds in the set of thresholds to determine N.
24. The apparatus defined in Claim 23 wherein the thresholds are based on a model of delay statistics of different types of portions of data objects.
25. The apparatus defined in Claim 14 wherein the processor:
assigns a distinct key to each of the K portions;
assigns a distinct key to each of N blocks of data resulting from applying the erasure coding to the K portions; and
orders the keys assigned to the K portions and the keys assigned to the N blocks; wherein the processor causes the communication interface to send the N blocks of data using N separate transfers to the storage system by sending all N blocks in parallel on separate connections to the key-value store, and cancels any of the N separate transfers that haven't been completed successfully after K of the N separate transfers have completed successfully.
26. The apparatus defined in Claim 14 wherein the processor:
generates a plurality of individual requests, each request for requesting one of the N blocks of data from storage;
applies erasure decoding as each of N blocks are received;
cancels N-K requests that remain outstanding after receiving K out of N blocks; and returns the object to a requester.
27. An article of manufacture having one or more non-transitory computer readable storage media storing instructions which, when executed by a system, causes the system to perform a method comprising:
receive a data object and a request;
dividing the data object into K portions, where K is an integer; oojjrwo 2014/088624 PCT/US2013/030926 selecting an FEC coding rate based on backlog associated with at least one queue;
apply FEC coding based on the FEC rate set to the K portions to create N FEC coded data blocks, where N is an integer greater than or equal to K; and
sending the N FEC coded data blocks to a key-value based storage system.
28. The article of manufacture defined in Claim 27 wherein the at least one queue comprises a first queue into which requests to the key-value based storage are received.
PCT/US2013/030926 2012-12-04 2013-03-13 A method and apparatus for code length adaptation for access to key-value based cloud storage systems WO2014088624A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/649,530 US20150309874A1 (en) 2012-12-04 2013-03-13 A method and apparatus for code length adaptation for access to key-value based cloud storage systems

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201261733339P 2012-12-04 2012-12-04
US61/733,339 2012-12-04

Publications (1)

Publication Number Publication Date
WO2014088624A1 true WO2014088624A1 (en) 2014-06-12

Family

ID=50883844

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2013/030926 WO2014088624A1 (en) 2012-12-04 2013-03-13 A method and apparatus for code length adaptation for access to key-value based cloud storage systems

Country Status (2)

Country Link
US (1) US20150309874A1 (en)
WO (1) WO2014088624A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102693A (en) * 2014-06-19 2014-10-15 广州华多网络科技有限公司 Object processing method and device

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102154187B1 (en) * 2014-08-07 2020-09-09 삼성전자 주식회사 Memory device, memory system and operating method of memory system
JP6128083B2 (en) * 2014-09-08 2017-05-17 富士通株式会社 Wireless communication apparatus and calculation method
US9552223B2 (en) * 2014-09-30 2017-01-24 International Business Machines Corporation Post-return asynchronous code execution
US10628221B1 (en) * 2015-09-30 2020-04-21 EMC IP Holding Company LLC Method and system for deadline inheritance for resource synchronization
US9733834B1 (en) * 2016-01-28 2017-08-15 Weka.IO Ltd. Congestion mitigation in a distributed storage system
US10650621B1 (en) 2016-09-13 2020-05-12 Iocurrents, Inc. Interfacing with a vehicular controller area network
US10783136B1 (en) * 2017-02-28 2020-09-22 Virtuozzo International Gmbh Management of garbage data in distributed systems
US10691694B2 (en) * 2017-06-13 2020-06-23 Oracle International Corporation Method and system for defining an object-agnostic offlinable synchronization model
US11693906B2 (en) 2017-06-13 2023-07-04 Oracle International Comporation Method and system for using access patterns to suggest or sort objects
US10846283B2 (en) 2017-06-13 2020-11-24 Oracle International Corporation Method and system for defining an adaptive polymorphic data model
CN107592275A (en) * 2017-11-09 2018-01-16 深圳门之间科技有限公司 One kind is lined up control method and its system
US10929186B2 (en) * 2018-12-12 2021-02-23 Servicenow, Inc. Control token and hierarchical dynamic control
US11048553B1 (en) * 2020-09-14 2021-06-29 Gunther Schadow Processing of messages and documents carrying business transactions

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100273503A1 (en) * 1997-02-11 2010-10-28 Qualcomm Incorporated Method and apparatus for forward link rate scheduling
US20100323708A1 (en) * 2009-06-22 2010-12-23 Kozat Ulas C Method and apparatus for sending information via silent symbol coding over under-utilized channels in wireless systems
US20120254688A1 (en) * 2011-04-01 2012-10-04 Cleversafe, Inc. Appending data to existing data stored in a dispersed storage network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7508785B2 (en) * 2002-11-06 2009-03-24 Broadcom Corporation Downstream time domain based adaptive modulation for DOCSIS based applications
US9201825B1 (en) * 2011-11-02 2015-12-01 Marvell International Ltd. Data storage methods and apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100273503A1 (en) * 1997-02-11 2010-10-28 Qualcomm Incorporated Method and apparatus for forward link rate scheduling
US20100323708A1 (en) * 2009-06-22 2010-12-23 Kozat Ulas C Method and apparatus for sending information via silent symbol coding over under-utilized channels in wireless systems
US20120254688A1 (en) * 2011-04-01 2012-10-04 Cleversafe, Inc. Appending data to existing data stored in a dispersed storage network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102693A (en) * 2014-06-19 2014-10-15 广州华多网络科技有限公司 Object processing method and device

Also Published As

Publication number Publication date
US20150309874A1 (en) 2015-10-29

Similar Documents

Publication Publication Date Title
US20150309874A1 (en) A method and apparatus for code length adaptation for access to key-value based cloud storage systems
US9426517B2 (en) Method and apparatus for low delay access to key-value based storage systems using FEC techniques
US10318467B2 (en) Preventing input/output (I/O) traffic overloading of an interconnect channel in a distributed data storage system
US11558270B2 (en) Monitoring a stale data queue for deletion events
CN111629028B (en) Data transmission scheduling system for distributed multi-cloud storage
US10140236B2 (en) Receiving buffer credits by a plurality of channels of one or more host computational devices for transmitting data to a control unit
US8315992B1 (en) Affinity based allocation for storage implementations employing deduplicated data stores
US10289591B2 (en) Adjustment of buffer credits and other parameters in a startup phase of communications between a plurality of channels and a control unit
US10419528B2 (en) Dynamically instantiating and terminating data queues
US9998532B2 (en) Computer-based, balanced provisioning and optimization of data transfer resources for products and services
US10599529B2 (en) Instantiating data queues for management of remote data stores
KR20150132859A (en) Automatic tuning of virtual data center resource utilization policies
US20100115078A1 (en) Distributed storage system
US10404603B2 (en) System and method of providing increased data optimization based on traffic priority on connection
US10482084B2 (en) Optimized merge-sorting of data retrieved from parallel storage units
US10642585B1 (en) Enhancing API service schemes
US20170147518A1 (en) Scanning memory for de-duplication using rdma
US10929424B1 (en) Cloud replication based on adaptive quality of service
US10135750B1 (en) Satisfaction-ratio based server congestion control mechanism
EP3989053B1 (en) Creation and use of an efficiency set to estimate an amount of data stored in a data set of a storage system having one or more characteristics
US11627097B2 (en) Centralized quality of service management

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13859620

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 14649530

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13859620

Country of ref document: EP

Kind code of ref document: A1