WO2020199192A1 - Split-key estimation method for table partition in disbtributed data storage systems - Google Patents

Split-key estimation method for table partition in disbtributed data storage systems Download PDF

Info

Publication number
WO2020199192A1
WO2020199192A1 PCT/CN2019/081492 CN2019081492W WO2020199192A1 WO 2020199192 A1 WO2020199192 A1 WO 2020199192A1 CN 2019081492 W CN2019081492 W CN 2019081492W WO 2020199192 A1 WO2020199192 A1 WO 2020199192A1
Authority
WO
WIPO (PCT)
Prior art keywords
block
key
blocks
tree structure
data storage
Prior art date
Application number
PCT/CN2019/081492
Other languages
French (fr)
Inventor
Chen Fu
Chunhui SHEN
Original Assignee
Alibaba Group Holding Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Limited filed Critical Alibaba Group Holding Limited
Priority to PCT/CN2019/081492 priority Critical patent/WO2020199192A1/en
Publication of WO2020199192A1 publication Critical patent/WO2020199192A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Definitions

  • Embodiments of the present disclosure provide a method for split key estimation in distributed data storage system.
  • the distributed data storage system comprises partitions represented by a rooted tree structure comprising a root-level index block, a set of intermediate-level index blocks organized into one or more index levels, and a set of data-level blocks having row-key pairings that corresponds to indexes provided in the set of intermediate-level index blocks.
  • the method comprises obtaining a start key and an end key based on a request to split a partition, determining a start block based on a search of the rooted tree structure using the start key, determining an end block based on a search of the rooted tree structure using the end key, determining a middle block of the rooted tree structure based on the start block and the end block, determining a split key from the middle block, and splitting the partition based on the determined split key.
  • Embodiments of the present disclosure further provide a distributed data storage system comprising a plurality of partitions and a processor.
  • the plurality of partitions are configured to be organized in the form of a rooted tree structure comprising a root-level index block, intermediate-level index blocks organized into one or more index levels, and a set of data-level blocks having row-key pairings that corresponds to indexes provided in the intermediate-level index blocks.
  • the processor is configured to obtain a start key and an end key based on a request to split a partition, to determine a start block and an end block based on a search of the rooted tree structure using the start key and the end key respectively, to determine a middle block based on the start block and the end block, to determine a split key from the middle block, and to split the partition based on the determined split key.
  • Embodiments of the present disclosure further provide a non-transitory computer readable medium that stores a set of instructions that are executable by one or more processors of an apparatus to perform a method.
  • the method comprises obtaining a start key and an end key, determining a start block and an end block based on a search of a rooted tree structure using the obtained start key and the end key, determining a middle block based on the start block and the end block, determining a split key from the middle block, and splitting the partition based on the determined split key.
  • FIG. 1A is a schematic diagram illustrating an exemplary distributed data storage system comprising servers having partitions, according to some embodiments of the present disclosure.
  • FIG. 1B is a schematic diagram illustrating an exemplary server of a distributed data storage system, according to some embodiments of the present disclosure.
  • FIG. 2 is a schematic diagram of an exemplary data organization inside a partition using a rooted tree in a distributed data storage system, according to some embodiments of the present disclosure.
  • FIG. 3 is a flow diagram of an exemplary method for light-weight split key estimation in a distributed data storage system, according to some embodiments of the present disclosure.
  • FIG. 4 is a flow diagram of an exemplary method for light-weight split key estimation incorporating an estimation threshold in a distributed data storage system, according to some embodiments of the present disclosure.
  • FIG. 5 is a flow diagram of an exemplary method for light-weight split key estimation incorporating a start offset and an end offset in a distributed data storage system, according to some embodiments of the present disclosure.
  • a data table is divided into many partitions.
  • a partition often includes rows whose keys are within a range. For example, in a table where the keys are integers, a partition may contain rows with keys in [4096, 8191] .
  • Several partitions are assigned to a server, where read and write requests are served from. Partitions may be re-assigned to other servers for load balancing or recovery. Keeping partition sizes under a certain threshold is important for partition re-assignment to work efficiently.
  • Partitions having large sizes slow down the re-assignment process, which leads to slow recovery, difficulty in load balancing, and potential risks of running out of disk space for the underlying server.
  • a partition When a partition is split into multiple smaller partitions, the original big partition is referred to as a “parent partition, ” and the resulting smaller partitions are referred to as “child partitions. ”
  • Some systems divide one parent partition into two child partitions with a single “split key” that divides the parent’s key range into two smaller ranges. For example, a split key 1000 divides a range [0, 1999] into two smaller ranges [0, 999] and [1000, 1999] .
  • Other systems use n split keys to split a parent partition into n+1 child partitions.
  • a parent partition When a parent partition is split, it is desirable that the child partitions are approximately the same size.
  • Some systems use a simple, arithmetic center of the parent partition’s range as the split key. For example, for a parent partition with a range of [0, 1999] , some conventional systems choose to use 1000 as the split key. Unfortunately, the arithmetic center of the parent partition’s range often cannot create an even split to create child partitions of approximately the same size because the distribution of the keys is unknown. Referring to the previous example where 1000 is used as the split key to split the parent partition with the range [0, 1999] , if most of the keys are greater than 1000, the child partitions end up with highly uneven sizes. Other conventional systems try to scan all the data in the parent partition to find split keys for even splitting.
  • Another problem with conventional systems is that a single mid-key is only good for splitting a partition into two.
  • the system desires to split the parent partition into more than two child partitions.
  • FIG. 1A is a schematic diagram illustrating an exemplary distributed data storage system comprising servers having partitions, according to some embodiments of the present disclosure.
  • exemplary distributed data storage system 100 comprises a plurality of servers 1-K, with each server having 10 different partitions.
  • server 1 comprises partitions A1-A10
  • server 2 comprises partitions B1-B10
  • server K comprises of partitions K1-K10.
  • Distributed data storage system 100 allows each partition to be re-assigned to a different server. It is appreciated that one or more of these servers can incorporate fast partition splitting.
  • FIG. 1B is a schematic diagram illustrating an exemplary server 110 of a distributed data storage system, according to some embodiments of the present disclosure.
  • server 110 comprises a bus 112 or other communication mechanism for communicating information, and one or more processors 116 communicatively coupled with bus 112 for processing information.
  • processors 116 can be, for example, one or more microprocessors.
  • Server 110 further comprises storage devices 114, which may include random access memory (RAM) , read only memory (ROM) , and data storage systems comprised of partitions.
  • Storage devices 114 can be communicatively coupled with processors 116 via bus 112.
  • Storage devices 114 may include a main memory, which can be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processors 116. Such instructions, after being stored in non-transitory storage media accessible to processors 116, render server 110 into a special-purpose machine that is customized to perform operations specified in the instructions.
  • Server 110 can transmit data to or communicate with another server 130 through a network 122.
  • Network 122 can be a local network, an internet service provider, internet, or any combination thereof.
  • Communication interface 118 of server 110 is connected to network 122.
  • server 110 can be coupled via bus 112 to peripheral devices 140, which comprises displays (e.g., cathode ray tube (CRT) , liquid crystal display (LCD) , touch screen, etc. ) and input devices (e.g., keyboard, mouse, soft keypad, etc. ) .
  • displays e.g., cathode ray tube (CRT) , liquid crystal display (LCD) , touch screen, etc.
  • input devices e.g., keyboard, mouse, soft keypad, etc.
  • Server 110 can be implemented using customized hard-wired logic, one or more ASICs or FPGAs, firmware, or program logic that in combination with the server causes server 110 to be a special-purpose machine.
  • non-transitory media refers to any non-transitory media storing data or instructions that cause a machine to operate in a specific fashion. Such non- transitory media can comprise non-volatile media and/or volatile media.
  • Non-transitory media include, for example, optical or magnetic disks, dynamic memory, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, flash memory, register, cache, any other memory chip or cartridge, and networked versions of the same.
  • Various forms of media can be involved in carrying one or more sequences of one or more instructions to processors 116 for execution.
  • the instructions can initially be carried out on a magnetic disk or solid state drive of a remote computer.
  • the remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem.
  • a modem local to server 110 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
  • An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 112.
  • Bus 112 carries the data to the main memory within storage devices 114, from which processors 116 retrieves and executes the instructions.
  • FIG. 2 is a schematic diagram of an exemplary data organization using a rooted tree in a distributed data storage system, according to some embodiments of the present disclosure.
  • the rooted tree comprises three levels: a root level, an index level, and a data level, although any number of levels may be used.
  • the root level and the index level contain only index blocks, which are colored in white.
  • the root level comprises index block 1100, which comprise index entries with entry keys 11, 150, 276 and 720.
  • the pointer in an index entry can either point to a data block in the data level or another index block in a lower index level of the rooted tree structure.
  • the entry key in the index entry is the first key of the block the entry points to. For example, index entry with the entry key 11 in index block 1100 points to the index block 2100 whose first entry key is 11.
  • the index level comprises index blocks 2100, 2200, 2300 and 2400.
  • Each index block comprises one or more index entries, and each index entry has an entry key and a pointer.
  • index block 2100 comprises index entries with keys 11, 20, and 46.
  • the pointer in an index entry can either point to a data block in the data level or another index block in a lower index level of the rooted tree structure.
  • the entry key in an index entry is the first key of the block the entry points to.
  • index entry with the entry key 276 in index block 2300 points to a data block 3400 whose first entry key is 276.
  • the data level contains only data blocks, which are colored in grey.
  • the data level comprises data blocks 3100, 3200, 3300, 3400, 3500 and 3600.
  • Each data block comprises row-key pairings, which are rows of data labeled by keys.
  • each index block has the same size except the index block with the largest entry key at each level.
  • N is an integer and there are N-number of index blocks on the same level
  • index blocks 1 through N-1 have the same number of index entries.
  • index blocks 2100, 2200, and 2300 have the same or substantially the same size
  • index block 2400 (which has the largest entry keys at the index level) may have a smaller size compared to other index blocks.
  • the number of index entries in each index block ranges from hundreds to thousands.
  • the amount of data in each data block can range from 64KB to 128KB.
  • all data blocks have the same size except the data block with the largest entry key.
  • data bocks 1 through X-1 have the same number of row-key pairings.
  • data blocks 3100, 3200, 3300, 3400 and 3500 have the same size
  • data block 3600 may have a smaller size compared to other data blocks.
  • Each data entry within a data block points to a row of data having a row key that is the same as the entry key of the data entry. For example, data entry with entry key 11 in data block 3100 points to a row of data whose row key is 11.
  • FIG. 3 is a flow diagram of an exemplary method for light-weight split key estimation in a distributed data storage system, according to some embodiments of the present disclosure.
  • method 10000 comprises steps 10010, 10020, 10030, 10040, 10050 and 10060. It is appreciated that method 10000 can be performed by a distributed data storage system (e.g., distributed data storage system 100 of FIG. 1A) or by one or more servers (e.g., exemplary server 110 of FIG. 1B) .
  • a distributed data storage system e.g., distributed data storage system 100 of FIG. 1A
  • servers e.g., exemplary server 110 of FIG. 1B
  • step 10010 the distributed data storage system obtains a start key and an end key.
  • the start key and the end key form a key range in which method 10000 determines an estimated mid key as the split key for partition splitting. For example, as shown in FIG. 2, the distributed data storage system obtains a start key 11 and an end key 405.
  • the distributed data storage system searches the root level of the rooted tree structure for the index entry associated with the start key and the end key. For example, as shown in FIG. 2, the distributed data storage system searches root level 1100 and finds index entry with the entry key 11, which is associated with start key 11. Since root level 1100 does not contain an index entry with entry key 405, the distributed data storage system finds the closest entry key that is smaller than the end key 405, which is 276.
  • the distributed data storage system determines a start block according to the start key.
  • the start block has a key range in which the start key is contained.
  • a key range of a block is from the first entry in the block to the first entry of the next data block. For example, as shown in FIG. 2, if the index block after index block 2100 has a first index entry 51, index block 2100 has a key range of 11 to 50.
  • the distributed data storage system determines that index block 2100 is the start block because index block 2100 contains a key range of 11 to 50, which covers the start key 11.
  • the distributed data storage system determines an end block according to the end key.
  • the end block has a key range in which the end key is contained. For example, as shown in FIG. 2, if the index block after index block 2300 has a first index entry 541, index block 2300 has a key range of 276 to 540.
  • the distributed data storage system determines that index block 2300 is the end block because index block 2300 has a key range of 276 to 540, which covers the end key 405.
  • the distributed data storage system determines a middle block according to the start block and the end block.
  • the middle block is the block located in the center between the start block and the end block.
  • the middle block is the block located to the immediate left of the center between the start block and the end block (although, depending on the configuration, the immediate right of the center can also be selected as the center block) .
  • the index block located in the center is the third block, which is index block 2200.
  • the distributed data storage system determines index block 2200 as the middle block.
  • the distributed data storage system determines a split key from the middle block.
  • the split key is the index located in the center of the middle block.
  • index block 2200 is the middle block and it has 13 index entries.
  • the split key is the entry key of the index entry at the 7th location, which is 171.
  • all index blocks except the index block at the root level have the same number of index entries.
  • method 10000 determines the block located in the center between the start block and the end block, method 10000 is taking advantage of the fact that all index blocks are of the same size or substantially the same size. Therefore, the block located in the center is generally a good estimation of where the split key should be located to lead to an even split, resulting in partitions having similar amount of data, and the entry key located at the center of the middle block is generally a good estimation of the split key itself. If the distributed data storage system is not satisfied with the accuracy of the current estimation, the distributed data storage system may obtain a more accurate result by searching one level below. In addition to accuracy, the method makes an estimation for split key without having to traverse through all the row keys in a partition, thereby improving the efficiency of partition split.
  • the distributed data storage system can achieve an M-way split where M is a power of 2. For example, if the distributed data storage system obtains a start key 11 and an end key 450 and determines that 171 is the split key using method 10000, the distributed data storage system may initiate additional estimation processes for two more split keys where the search ranges are [11, 171] and [171, 450] . In the first estimation process, the distributed data storage system obtains a start key 11 and an end key 171. In the second estimation process, the distributed data storage system obtains a start key 171 and an end key 450. Having determined the split keys for both searches using method 10000, the distributed data storage system can perform a 4-way split based on the three split keys, including split key 171.
  • FIG. 4 is a flow diagram of an exemplary method for light-weight split key estimation incorporating an estimation threshold in a distributed data storage system, according to some embodiments of the present disclosure.
  • method for split key estimation in FIG. 4 further comprises additional steps 10015, 10051, 10052 and 10053.
  • method 10000 can be performed by a distributed data storage system (e.g., distributed data storage system 100 of FIG. 1A) or by one or more servers (e.g., exemplary server 110 of FIG. 1B) .
  • Step 10015 is performed after step 10010 for obtaining the start and end keys of a rooted tree structure.
  • the distributed data storage system obtains an estimation threshold.
  • the estimation threshold refers to the minimum number of blocks between the start block and the end block that the system accepts before the system stops the estimation process.
  • the estimation threshold may be a preset value, or it may be determined by the system or a user. For example, in the distributed data storage system shown in FIG. 2, the distributed data storage system may obtain an estimation threshold of 10.
  • Step 10051 is performed after step 10050 in which a middle block is determined.
  • the distributed data storage system determines whether the estimation process has reached a data level. If the distributed data storage system determines that the current level being searched is the data level, the distributed data storage system executes step 10060 and determines a split key in the middle block. If the distributed data storage system determines that the current level being searched is not the data level, the distributed data storage system executes step 10052. For example, as shown FIG. 2, when the distributed data storage system reaches the index level and determines that the middle block is 2200, the distributed data storage system determines that the current level being searched is not a data level. As a result, the distributed data storage system moves on to step 10052.
  • step 10052 the distributed data storage system determines whether the number of blocks between the start block and the end block is larger than the estimation threshold. If it is larger, the distributed data storage system is satisfied with the level of accuracy in split key estimation provided in the current level. The distributed data storage system then executes step 10060 and determines a split key in the middle block. If the number of blocks is less than or equal to the estimation threshold, the distributed data storage system executes step 10053. For example, as shown in FIG. 2, there are five data blocks between the start block 2100 and the end block 2300. If the obtained estimation threshold is 10, the number of blocks in between is smaller than the estimation threshold, and the distributed data storage system moves on to step 10053. If the obtained estimation threshold is 3, the number of blocks in between is larger than the estimation threshold, and the distributed data storage system moves on to step 10060.
  • step 10053 the distributed data storage system searches the next lower level in the rooted tree structure.
  • the distributed data storage system then moves on to step 10030 and determines a start block, an end block and a middle block in the new level. For example, as shown in FIG. 2, when the distributed data storage system determines that the number of blocks between the start block 2100 and the end block 2300 is smaller than the estimation threshold, the distributed data storage system searches the level below, which is the data level.
  • the distributed data storage system then moves on to step 10030 and determines a start block, an end block and a middle block in the data level. In the example shown in FIG. 2, the distributed data storage system determines that data block 3100 is the new start block for start key 11, and data block 3500 is the new end block for end key 405. If there are 50 data blocks between the start block 3100 and the end block 3500, the middle block is the data block located in the center between the start block and the end block, which is the 25th data block.
  • estimation threshold allows the distributed data storage system to make decisions on the tradeoff between accuracy and efficiency in the estimation process. If accuracy is more critical to the system, the estimation threshold can be set to be a very large number, and the distributed data storage system would search multiple levels below the root level. If efficiency is more critical to the system, the estimation threshold can be set to be a smaller number, and the distributed data storage system would search only a few levels. The use of estimation threshold provides the distributed data storage system more flexibility in the estimation process.
  • FIG. 5 is a flow diagram of an exemplary method for light-weight split key estimation incorporating a start offset and an end offset in a distributed data storage system, according to some embodiments of the present disclosure.
  • method 10000 for split key estimation in FIG. 5 further comprises steps 10031, 10041 and 10060. It is appreciated that method 10000 can be performed by a distributed data storage system (e.g., distributed data storage system 100 of FIG. 1A) or by one or more servers (e.g., exemplary server 110 of FIG. 1B) .
  • a distributed data storage system e.g., distributed data storage system 100 of FIG. 1A
  • servers e.g., exemplary server 110 of FIG. 1B
  • Step 10031 is performed after step 10030 in which the start block is determined based on the start key.
  • the distributed data storage system determines a start offset according to the start block and the start key.
  • the start offset is the location of the index entry whose entry key is closest to the start key in the start block. For example, as shown in FIG. 2, the start key is 11 and the index entry with the entry key closest to the start key in the start block 2100 is 11, which is the first index entry in the start block.
  • the start offset is the location of the entry key closest to the start key in the start block. In our example, the location of entry key 11 in the start block 2100 is 1 (e.g., first index entry) , and hence the start offset is 1.
  • Step 10041 is performed after step 10040 in which the end block is determined based on the end key.
  • the distributed data storage system determines an end offset according to the end block and the end key.
  • the end offset is the location of the index entry whose entry key is closest to the end key in the end block.
  • the end key is 405 and the index entry with the entry key closest to the end key in the end block 2300 is 405, which is the second index in the end block.
  • the end offset is the location of the entry key closest to the end key in the end block.
  • the location of entry key 405 in the end block 2300 is 2 (e.g., second index entry) , and hence the end offset is 2.
  • the end key is 521 and the end block is 2300
  • the index entry with the entry key closest to the end key in the end block 2300 is 520. If an index block has a size of 13 indices, index entry 520 is the 13th index entry in the end block 2300. Therefore, the end offset is 13.
  • the distributed data storage system determines a split key in the middle block according to the start offset and the end offset.
  • the split key is the index in the middle block that has a location equal to the average of the start offset and the end offset.
  • the start offset is 1 and the end offset is 2.
  • the average of the start offset and the end offset is 1.5, which can be adjusted to an integer of 2.
  • the split key can be determined as the second index within the middle block, which in this case is index 157 of middle block 2200. Therefore, the split key is determined to be 157.
  • the start offset is 1 and the end offset is 13.
  • the average of the start offset and the end offset can be adjusted to 7.
  • the 7th index entry in the middle block 2200 is 171. Therefore, the split key is determined to be 171.
  • the distributed data storage system is able to pinpoint a more accurate location of the split key in the middle block, thereby improving the overall accuracy of the method for split key estimation.
  • a computer-readable medium may include removeable and nonremovable storage devices including, but not limited to, Read Only Memory (ROM) , Random Access Memory (RAM) , compact discs (CDs) , digital versatile discs (DVD) , etc.
  • program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A distributed data storage system incorporating a light-weight method for split key estimation is provided. The distributed data storage system comprises partitions represented by a rooted tree structure comprising a root-level index block(1100), a set of intermediate-level index blocks(2100,2200,2300,2400) organized into one or more index levels, and a set of data-level blocks(3100,3200,3300,3400,3500,3600) having row-key pairings that corresponds to indexes provided in the set of intermediate-level index blocks(2100,2200,2300,2400) and the root-level index block(1100). The method comprises obtaining a start key and an end key based on a request to split a partition, determining a start block based on a search of the rooted tree structure using the start key, determining an end block based on a search of the rooted tree structure using the end key, determining a middle block of the rooted tree structure based on the start block and the end block, determining a split key from the middle block, and splitting the partition based on the determined split key. A non-transitory computer-readable medium is also provided that stores a set of instructions that are executable by one or more processors of an apparatus to perform the light-weight method for split key estimation.

Description

[Title established by the ISA under Rule 37.2] SPLIT-KEY ESTIMATION METHOD FOR TABLE PARTITION IN DISBTRIBUTED DATA STORAGE SYSTEMS BACKGROUND
In distributed data storage systems, it is often not feasible to store all the data in a single table. Instead, such systems divide a data table into distinct parts called “partitions. ” When the data volume of a table grows, there is a need to split partitions in order to keep partition sizes manageable. Since the distribution of data is often uneven in a distributed data storage system, finding at least one split key that divides a key range evenly can be both challenging and time-consuming. There exists a need for a light-weight estimation method to determine one or more split keys.
SUMMARY
Embodiments of the present disclosure provide a method for split key estimation in distributed data storage system. The distributed data storage system comprises partitions represented by a rooted tree structure comprising a root-level index block, a set of intermediate-level index blocks organized into one or more index levels, and a set of data-level blocks having row-key pairings that corresponds to indexes provided in the set of intermediate-level index blocks. The method comprises obtaining a start key and an end key based on a request to split a partition, determining a start block based on a search of the rooted tree structure using the start key, determining an end block based on a search of the rooted tree structure using the end key, determining a middle block of the rooted tree structure based on the start block and the end block, determining a split key from the middle block, and splitting the partition based on the determined split key.
Embodiments of the present disclosure further provide a distributed data storage system comprising a plurality of partitions and a processor. The plurality of partitions are configured to be organized in the form of a rooted tree structure comprising a root-level index block, intermediate-level index blocks organized into one or more index levels, and a set of data-level blocks having row-key pairings that corresponds to indexes provided in the intermediate-level index blocks. The processor is configured to obtain a start key and an end key based on a request to split a partition, to determine a start block and an end block based on a search of the rooted tree structure using the start key and the end key respectively, to determine a middle block based on the start block and the end block, to determine a split key from the middle block, and to split the partition based on the determined split key.
Embodiments of the present disclosure further provide a non-transitory computer readable medium that stores a set of instructions that are executable by one or more processors of an apparatus to perform a method. The method comprises obtaining a start key and an end key, determining a start block and an end block based on a search of a rooted tree structure using the obtained start key and the end key, determining a middle block based on the start block and the end block, determining a split key from the middle block, and splitting the partition based on the determined split key.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments and various aspects of the present disclosure are illustrated in the following detailed description and the accompanying figures. Various features shown in the figures are not drawn to scale.
FIG. 1A is a schematic diagram illustrating an exemplary distributed data storage system comprising servers having partitions, according to some embodiments of the present disclosure.
FIG. 1B is a schematic diagram illustrating an exemplary server of a distributed data storage system, according to some embodiments of the present disclosure.
FIG. 2 is a schematic diagram of an exemplary data organization inside a partition using a rooted tree in a distributed data storage system, according to some embodiments of the present disclosure.
FIG. 3 is a flow diagram of an exemplary method for light-weight split key estimation in a distributed data storage system, according to some embodiments of the present disclosure.
FIG. 4 is a flow diagram of an exemplary method for light-weight split key estimation incorporating an estimation threshold in a distributed data storage system, according to some embodiments of the present disclosure.
FIG. 5 is a flow diagram of an exemplary method for light-weight split key estimation incorporating a start offset and an end offset in a distributed data storage system, according to some embodiments of the present disclosure.
DETAILED DESCRIPTION
The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the disclosure. Instead, they  are merely examples of apparatuses and methods consistent with aspects related to the disclosure as recited in the appended claims.
In distributed data storage systems, a data table is divided into many partitions. A partition often includes rows whose keys are within a range. For example, in a table where the keys are integers, a partition may contain rows with keys in [4096, 8191] . Several partitions are assigned to a server, where read and write requests are served from. Partitions may be re-assigned to other servers for load balancing or recovery. Keeping partition sizes under a certain threshold is important for partition re-assignment to work efficiently.
When the size of a partition grows, there is a need to split large partitions into smaller partitions in order to keep partition sizes manageable. Partitions having large sizes slow down the re-assignment process, which leads to slow recovery, difficulty in load balancing, and potential risks of running out of disk space for the underlying server.
When a partition is split into multiple smaller partitions, the original big partition is referred to as a “parent partition, ” and the resulting smaller partitions are referred to as “child partitions. ” Some systems divide one parent partition into two child partitions with a single “split key” that divides the parent’s key range into two smaller ranges. For example, a split key 1000 divides a range [0, 1999] into two smaller ranges [0, 999] and [1000, 1999] . Other systems use n split keys to split a parent partition into n+1 child partitions.
When a parent partition is split, it is desirable that the child partitions are approximately the same size. Some systems use a simple, arithmetic center of the parent partition’s range as the split key. For example, for a parent partition with a range of [0, 1999] , some conventional systems choose to use 1000 as the split key. Unfortunately, the arithmetic center of the parent partition’s range often cannot create an even split to create child partitions of  approximately the same size because the distribution of the keys is unknown. Referring to the previous example where 1000 is used as the split key to split the parent partition with the range [0, 1999] , if most of the keys are greater than 1000, the child partitions end up with highly uneven sizes. Other conventional systems try to scan all the data in the parent partition to find split keys for even splitting. This method, however, ignores the fact that when a system decides to split a parent partition, the system has already determined that the parent partition was experiencing heavy traffic and needs to be split immediately to alleviate the heavy traffic. Scanning all the data in the parent partition is, therefore, highly inefficient and creates large overhead in the system.
Another problem with conventional systems is that a single mid-key is only good for splitting a partition into two. When the data in the parent partition is growing at a fast pace, the system desires to split the parent partition into more than two child partitions.
Embodiments of the present disclosure mitigate this problem using a new light-weight method and system to estimate split keys in order to produce even splits. FIG. 1A is a schematic diagram illustrating an exemplary distributed data storage system comprising servers having partitions, according to some embodiments of the present disclosure. According to FIG. 1A, exemplary distributed data storage system 100 comprises a plurality of servers 1-K, with each server having 10 different partitions. For example, server 1 comprises partitions A1-A10, server 2 comprises partitions B1-B10, and server K comprises of partitions K1-K10. Distributed data storage system 100 allows each partition to be re-assigned to a different server. It is appreciated that one or more of these servers can incorporate fast partition splitting.
FIG. 1B is a schematic diagram illustrating an exemplary server 110 of a distributed data storage system, according to some embodiments of the present disclosure.  According to FIG. 1B, server 110 comprises a bus 112 or other communication mechanism for communicating information, and one or more processors 116 communicatively coupled with bus 112 for processing information. Processors 116 can be, for example, one or more microprocessors.
Server 110 further comprises storage devices 114, which may include random access memory (RAM) , read only memory (ROM) , and data storage systems comprised of partitions. Storage devices 114 can be communicatively coupled with processors 116 via bus 112. Storage devices 114 may include a main memory, which can be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processors 116. Such instructions, after being stored in non-transitory storage media accessible to processors 116, render server 110 into a special-purpose machine that is customized to perform operations specified in the instructions.
Server 110 can transmit data to or communicate with another server 130 through a network 122. Network 122 can be a local network, an internet service provider, internet, or any combination thereof. Communication interface 118 of server 110 is connected to network 122. In addition, server 110 can be coupled via bus 112 to peripheral devices 140, which comprises displays (e.g., cathode ray tube (CRT) , liquid crystal display (LCD) , touch screen, etc. ) and input devices (e.g., keyboard, mouse, soft keypad, etc. ) .
Server 110 can be implemented using customized hard-wired logic, one or more ASICs or FPGAs, firmware, or program logic that in combination with the server causes server 110 to be a special-purpose machine.
The term “non-transitory media” as used herein refers to any non-transitory media storing data or instructions that cause a machine to operate in a specific fashion. Such non- transitory media can comprise non-volatile media and/or volatile media. Non-transitory media include, for example, optical or magnetic disks, dynamic memory, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, flash memory, register, cache, any other memory chip or cartridge, and networked versions of the same.
Various forms of media can be involved in carrying one or more sequences of one or more instructions to processors 116 for execution. For example, the instructions can initially be carried out on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to server 110 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 112. Bus 112 carries the data to the main memory within storage devices 114, from which processors 116 retrieves and executes the instructions.
Inside each partition, data is organized in a rooted tree structure comprising multiple runs sorted by row keys. FIG. 2 is a schematic diagram of an exemplary data organization using a rooted tree in a distributed data storage system, according to some embodiments of the present disclosure. According to FIG. 2, the rooted tree comprises three levels: a root level, an index level, and a data level, although any number of levels may be used.
The root level and the index level contain only index blocks, which are colored in white. The root level comprises index block 1100, which comprise index entries with  entry keys  11, 150, 276 and 720. The pointer in an index entry can either point to a data block in the data  level or another index block in a lower index level of the rooted tree structure. The entry key in the index entry is the first key of the block the entry points to. For example, index entry with the entry key 11 in index block 1100 points to the index block 2100 whose first entry key is 11.
The index level comprises index blocks 2100, 2200, 2300 and 2400. Each index block comprises one or more index entries, and each index entry has an entry key and a pointer. For example, index block 2100 comprises index entries with  keys  11, 20, and 46. There can be multiple index levels when files are large. The pointer in an index entry can either point to a data block in the data level or another index block in a lower index level of the rooted tree structure. The entry key in an index entry is the first key of the block the entry points to. For example, index entry with the entry key 276 in index block 2300 points to a data block 3400 whose first entry key is 276. The data level contains only data blocks, which are colored in grey. The data level comprises data blocks 3100, 3200, 3300, 3400, 3500 and 3600. Each data block comprises row-key pairings, which are rows of data labeled by keys.
In some embodiments, other than index block 1100 at the root level, each index block has the same size except the index block with the largest entry key at each level. In other words, if N is an integer and there are N-number of index blocks on the same level, index blocks 1 through N-1 have the same number of index entries. For example, index blocks 2100, 2200, and 2300 have the same or substantially the same size, and index block 2400 (which has the largest entry keys at the index level) may have a smaller size compared to other index blocks. In some embodiments, if the distributed data storage system has multiple index levels, all index blocks across all levels except those with the largest entry key at each level have the same size. In some embodiments, the number of index entries in each index block ranges from hundreds to  thousands. In some embodiments, the amount of data in each data block can range from 64KB to 128KB.
Similarly, all data blocks have the same size except the data block with the largest entry key. In other words, if X is an integer and there are X-number of data blocks, data bocks 1 through X-1 have the same number of row-key pairings. For example, data blocks 3100, 3200, 3300, 3400 and 3500 have the same size, and data block 3600 may have a smaller size compared to other data blocks. Each data entry within a data block points to a row of data having a row key that is the same as the entry key of the data entry. For example, data entry with entry key 11 in data block 3100 points to a row of data whose row key is 11.
Embodiments of the present disclosure provides a light-weight method for split-key estimation. FIG. 3 is a flow diagram of an exemplary method for light-weight split key estimation in a distributed data storage system, according to some embodiments of the present disclosure. According to FIG. 3, method 10000 comprises  steps  10010, 10020, 10030, 10040, 10050 and 10060. It is appreciated that method 10000 can be performed by a distributed data storage system (e.g., distributed data storage system 100 of FIG. 1A) or by one or more servers (e.g., exemplary server 110 of FIG. 1B) .
In step 10010, the distributed data storage system obtains a start key and an end key. The start key and the end key form a key range in which method 10000 determines an estimated mid key as the split key for partition splitting. For example, as shown in FIG. 2, the distributed data storage system obtains a start key 11 and an end key 405.
In step 10020, the distributed data storage system searches the root level of the rooted tree structure for the index entry associated with the start key and the end key. For example, as shown in FIG. 2, the distributed data storage system searches root level 1100 and  finds index entry with the entry key 11, which is associated with start key 11. Since root level 1100 does not contain an index entry with entry key 405, the distributed data storage system finds the closest entry key that is smaller than the end key 405, which is 276.
In step 10030, the distributed data storage system determines a start block according to the start key. The start block has a key range in which the start key is contained. A key range of a block is from the first entry in the block to the first entry of the next data block. For example, as shown in FIG. 2, if the index block after index block 2100 has a first index entry 51, index block 2100 has a key range of 11 to 50. The distributed data storage system determines that index block 2100 is the start block because index block 2100 contains a key range of 11 to 50, which covers the start key 11.
In step 10040, the distributed data storage system determines an end block according to the end key. The end block has a key range in which the end key is contained. For example, as shown in FIG. 2, if the index block after index block 2300 has a first index entry 541, index block 2300 has a key range of 276 to 540. The distributed data storage system determines that index block 2300 is the end block because index block 2300 has a key range of 276 to 540, which covers the end key 405.
In step 10050, the distributed data storage system determines a middle block according to the start block and the end block. In some embodiments, the middle block is the block located in the center between the start block and the end block. In a scenario where there are an even number of blocks between the start block and the end block, the middle block is the block located to the immediate left of the center between the start block and the end block (although, depending on the configuration, the immediate right of the center can also be selected as the center block) . For example, as shown in FIG. 2, there are five index blocks between start  block 2100 and end block 2300. The index block located in the center is the third block, which is index block 2200. The distributed data storage system determines index block 2200 as the middle block.
In step 10060, the distributed data storage system determines a split key from the middle block. In some embodiments, the split key is the index located in the center of the middle block. For example, as shown in FIG. 2, index block 2200 is the middle block and it has 13 index entries. The split key is the entry key of the index entry at the 7th location, which is 171.
In some embodiments of the present disclosure, all index blocks except the index block at the root level have the same number of index entries. When method 10000 determines the block located in the center between the start block and the end block, method 10000 is taking advantage of the fact that all index blocks are of the same size or substantially the same size. Therefore, the block located in the center is generally a good estimation of where the split key should be located to lead to an even split, resulting in partitions having similar amount of data, and the entry key located at the center of the middle block is generally a good estimation of the split key itself. If the distributed data storage system is not satisfied with the accuracy of the current estimation, the distributed data storage system may obtain a more accurate result by searching one level below. In addition to accuracy, the method makes an estimation for split key without having to traverse through all the row keys in a partition, thereby improving the efficiency of partition split.
In some embodiments of the present disclosure, the distributed data storage system can achieve an M-way split where M is a power of 2. For example, if the distributed data storage system obtains a start key 11 and an end key 450 and determines that 171 is the split key using method 10000, the distributed data storage system may initiate additional estimation  processes for two more split keys where the search ranges are [11, 171] and [171, 450] . In the first estimation process, the distributed data storage system obtains a start key 11 and an end key 171. In the second estimation process, the distributed data storage system obtains a start key 171 and an end key 450. Having determined the split keys for both searches using method 10000, the distributed data storage system can perform a 4-way split based on the three split keys, including split key 171.
Some embodiments of the present disclosure incorporate an estimation threshold to determine whether the current level being searched is satisfactory for accuracy. FIG. 4 is a flow diagram of an exemplary method for light-weight split key estimation incorporating an estimation threshold in a distributed data storage system, according to some embodiments of the present disclosure. On the basis of FIG. 3, method for split key estimation in FIG. 4 further comprises  additional steps  10015, 10051, 10052 and 10053. It is appreciated that method 10000 can be performed by a distributed data storage system (e.g., distributed data storage system 100 of FIG. 1A) or by one or more servers (e.g., exemplary server 110 of FIG. 1B) .
Step 10015 is performed after step 10010 for obtaining the start and end keys of a rooted tree structure. In step 10015, the distributed data storage system obtains an estimation threshold. The estimation threshold refers to the minimum number of blocks between the start block and the end block that the system accepts before the system stops the estimation process. The estimation threshold may be a preset value, or it may be determined by the system or a user. For example, in the distributed data storage system shown in FIG. 2, the distributed data storage system may obtain an estimation threshold of 10.
Step 10051 is performed after step 10050 in which a middle block is determined. In step 10051, the distributed data storage system determines whether the estimation process has  reached a data level. If the distributed data storage system determines that the current level being searched is the data level, the distributed data storage system executes step 10060 and determines a split key in the middle block. If the distributed data storage system determines that the current level being searched is not the data level, the distributed data storage system executes step 10052. For example, as shown FIG. 2, when the distributed data storage system reaches the index level and determines that the middle block is 2200, the distributed data storage system determines that the current level being searched is not a data level. As a result, the distributed data storage system moves on to step 10052.
In step 10052, the distributed data storage system determines whether the number of blocks between the start block and the end block is larger than the estimation threshold. If it is larger, the distributed data storage system is satisfied with the level of accuracy in split key estimation provided in the current level. The distributed data storage system then executes step 10060 and determines a split key in the middle block. If the number of blocks is less than or equal to the estimation threshold, the distributed data storage system executes step 10053. For example, as shown in FIG. 2, there are five data blocks between the start block 2100 and the end block 2300. If the obtained estimation threshold is 10, the number of blocks in between is smaller than the estimation threshold, and the distributed data storage system moves on to step 10053. If the obtained estimation threshold is 3, the number of blocks in between is larger than the estimation threshold, and the distributed data storage system moves on to step 10060.
In step 10053, the distributed data storage system searches the next lower level in the rooted tree structure. The distributed data storage system then moves on to step 10030 and determines a start block, an end block and a middle block in the new level. For example, as shown in FIG. 2, when the distributed data storage system determines that the number of blocks  between the start block 2100 and the end block 2300 is smaller than the estimation threshold, the distributed data storage system searches the level below, which is the data level. The distributed data storage system then moves on to step 10030 and determines a start block, an end block and a middle block in the data level. In the example shown in FIG. 2, the distributed data storage system determines that data block 3100 is the new start block for start key 11, and data block 3500 is the new end block for end key 405. If there are 50 data blocks between the start block 3100 and the end block 3500, the middle block is the data block located in the center between the start block and the end block, which is the 25th data block.
The use of the estimation threshold allows the distributed data storage system to make decisions on the tradeoff between accuracy and efficiency in the estimation process. If accuracy is more critical to the system, the estimation threshold can be set to be a very large number, and the distributed data storage system would search multiple levels below the root level. If efficiency is more critical to the system, the estimation threshold can be set to be a smaller number, and the distributed data storage system would search only a few levels. The use of estimation threshold provides the distributed data storage system more flexibility in the estimation process.
Some embodiments of the present disclosure incorporate a start offset and an end offset to achieve a more accurate estimation. FIG. 5 is a flow diagram of an exemplary method for light-weight split key estimation incorporating a start offset and an end offset in a distributed data storage system, according to some embodiments of the present disclosure. On the basis of FIG. 3 and FIG. 4, method 10000 for split key estimation in FIG. 5 further comprises  steps  10031, 10041 and 10060. It is appreciated that method 10000 can be performed by a distributed  data storage system (e.g., distributed data storage system 100 of FIG. 1A) or by one or more servers (e.g., exemplary server 110 of FIG. 1B) .
Step 10031 is performed after step 10030 in which the start block is determined based on the start key. In step 10031, the distributed data storage system determines a start offset according to the start block and the start key. In some embodiments, the start offset is the location of the index entry whose entry key is closest to the start key in the start block. For example, as shown in FIG. 2, the start key is 11 and the index entry with the entry key closest to the start key in the start block 2100 is 11, which is the first index entry in the start block. The start offset is the location of the entry key closest to the start key in the start block. In our example, the location of entry key 11 in the start block 2100 is 1 (e.g., first index entry) , and hence the start offset is 1.
Step 10041 is performed after step 10040 in which the end block is determined based on the end key. In step 10041, the distributed data storage system determines an end offset according to the end block and the end key. In some embodiments, the end offset is the location of the index entry whose entry key is closest to the end key in the end block. For example, as shown in FIG. 2, the end key is 405 and the index entry with the entry key closest to the end key in the end block 2300 is 405, which is the second index in the end block. The end offset is the location of the entry key closest to the end key in the end block. In our example, the location of entry key 405 in the end block 2300 is 2 (e.g., second index entry) , and hence the end offset is 2. In another example, if the end key is 521 and the end block is 2300, the index entry with the entry key closest to the end key in the end block 2300 is 520. If an index block has a size of 13 indices, index entry 520 is the 13th index entry in the end block 2300. Therefore, the end offset is 13.
In step 10060, the distributed data storage system determines a split key in the middle block according to the start offset and the end offset. In some embodiments, the split key is the index in the middle block that has a location equal to the average of the start offset and the end offset. For example, as shown in FIG. 2, the start offset is 1 and the end offset is 2. The average of the start offset and the end offset is 1.5, which can be adjusted to an integer of 2. Using the integer 2, the split key can be determined as the second index within the middle block, which in this case is index 157 of middle block 2200. Therefore, the split key is determined to be 157. In another example, the start offset is 1 and the end offset is 13. The average of the start offset and the end offset can be adjusted to 7. The 7th index entry in the middle block 2200 is 171. Therefore, the split key is determined to be 171.
By implementing the start offset and the end offset, the distributed data storage system is able to pinpoint a more accurate location of the split key in the middle block, thereby improving the overall accuracy of the method for split key estimation.
The various example embodiments described herein are described in the general context of method steps or processes, which may be implemented in one aspect by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments. A computer-readable medium may include removeable and nonremovable storage devices including, but not limited to, Read Only Memory (ROM) , Random Access Memory (RAM) , compact discs (CDs) , digital versatile discs (DVD) , etc. Generally, program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of  the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.
In the foregoing specification, embodiments have been described with reference to numerous specific details that can vary from implementation to implementation. Certain adaptations and modifications of the described embodiments can be made. Other embodiments can be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims. It is also intended that the sequence of steps shown in figures are only for illustrative purposes and are not intended to be limited to any particular sequence of steps. As such, those skilled in the art can appreciate that these steps can be performed in a different order while implementing the same method.
In the drawings and specification, there have been disclosed exemplary embodiments. Many variations and modifications, however, can be made to these embodiments. Accordingly, although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation, the scope of the embodiments being defined by the following claims.

Claims (26)

  1. A method comprising:
    obtaining a first key and a second key based on a request to split a partition represented by a rooted tree structure;
    determining a start block of the rooted tree structure based on a search of the rooted tree structure using the obtained first key;
    determining an end block of the rooted tree structure based on a search of the rooted tree structure using the obtained second key;
    determining a middle block of the rooted tree structure based on the start block and the end block; and
    determining a split key from the middle block.
  2. The method according to claim 1, wherein the rooted tree structure comprises a root-level index block, a set of intermediate-level index blocks, and a set of data-level blocks having row-key pairings that corresponds to indexes provided in the set of intermediate-level index blocks.
  3. The method according to claim 2, wherein the set of intermediate-level index blocks comprises N-number of index blocks and index blocks 1 through N-1 have the same number of indexes, wherein
    N is an integer greater than 0.
  4. The method according to of claim 2 or 3, wherein the set of data-level blocks comprises X-number of data blocks and data blocks 1 through X-1 have the same number of row-key pairings, wherein
    X is an integer greater than 0.
  5. The method according to any one of claims 2-4, further comprising:
    obtaining an estimation threshold;
    determining the middle block in response to the estimation threshold being smaller than a number of blocks between the start block and the end block; and
    searching a next lower level of the rooted tree structure in response to the estimation threshold being larger than the number of blocks between the start block and the end block.
  6. The method of claim 5, wherein the next level of the rooted tree structures comprises another set of intermediate-level index blocks that are lower in the rooted tree structure than the set of intermediate-level index blocks, wherein the set of data-level blocks has row-key pairings that corresponds to indexes provided in the other set of intermediate-level index blocks.
  7. The method according to any one of claims 1-6, further comprising:
    determining a start offset according to the start key and the start block; and
    determining an end offset according to the end key and the end block.
  8. The method according to claim 7, wherein determining the start offset according to the start key and the start block further comprising finding a location of an index that is closest to the start key in the start block.
  9. The method according to claim 7 or 8, wherein determining the end offset according to the end key and the end block further comprising finding a location of an index that is closest to the end key in the end block.
  10. The method according to any one of claims 7-9, wherein determining the split key from the middle block further comprising finding the index in the middle block whose location in the middle block is based on an average of the start offset and the end offset.
  11. The method according to any one of claims 1-10, wherein determining the middle block of the rooted tree structure based on the start block and the end block further comprising determining the middle block as the block located at a center between the start block and the end block.
  12. The method according to any one of claims 1-11, further comprising splitting the partition based on the determined split key.
  13. A distributed data storage system, comprising:
    a plurality of partitions configured to be organized in a form of a rooted tree structure;
    a processor configured to:
    obtain a first key and a second key based on a request to split a partition;
    determine a start block of the rooted tree structure based on a search of the rooted tree structure using the obtained first key;
    determine an end block of the rooted tree structure based on a search of the rooted tree structure using the obtained second key;
    determine a middle block of the rooted tree structure based on the start block and the end block; and
    determine a split key from the middle block.
  14. The distributed data storage system according to claim 13, wherein the rooted tree structure comprises a root-level index block, a set of intermediate-level index blocks, and a set of data-level blocks having row-key pairings that corresponds to indexes provided in the set of intermediate-level index blocks.
  15. The distributed data storage system according to claim 14, wherein the set of intermediate-level index blocks comprises N-number of index blocks and index blocks 1 through N-1 have the same number of indexes, wherein:
    N is an integer greater than 0.
  16. The distributed data storage system according to claim 14 or 15, wherein the set of data-level blocks comprises X-number of data blocks and data blocks 1 through X-1 have the same number of row-key pairings, wherein
    X is an integer greater than 0.
  17. The distributed data storage system according to any one of claims 14-16, wherein the processor is further configured to:
    obtain an estimation threshold;
    determine the middle block in response to the estimation threshold being smaller than a number of blocks between the start block and the end block; and
    searching a next lower level of the rooted tree structure in response to the estimation threshold being larger than the number of blocks between the start block and the end block.
  18. The distributed data storage system according to claim 17, wherein the next level of the rooted tree structures comprises another set of intermediate-level index blocks that are lower in the rooted tree structure than the set of intermediate-level index blocks, wherein the set of data-level blocks has row-key pairings that corresponds to indexes provided in the other set of intermediate-level index blocks.
  19. The distributed data storage system according to claim 17 or 18, wherein the estimation threshold is equal to 0.
  20. The distributed data storage system according to any one of claims 13-19, wherein the processor is further configured to:
    determine a start offset according to the start key and the start block; and
    determine an end offset according to the end key and the end block.
  21. The distributed data storage system according to claim 20, wherein the processor is further configured to determine the start offset by finding a location of an index that is closest to the start key in the start block.
  22. The distributed data storage system according to claim 20 or 21, wherein the processor is further configured to determine the end offset by finding a location of an index that is closest to the end key in the end block.
  23. The distributed data storage system according to any one of claims 20-22, wherein the processor is further configured to determine the split key from the middle block by finding the index in the middle block whose location in the middle block based on an integer closest to an average of the start offset and the end offset.
  24. The distributed data storage system according to any one of claims 13-23, wherein the processor is further configured to determine the middle block as the block located at a center between the start block and the end block.
  25. The distributed data storage system according to any one of claims 13-24, wherein the processor is configured to split the partition based on the determined split key.
  26. A non-transitory computer readable medium that stores a set of instructions that are executable by one or more processors of an apparatus to perform a method comprising:
    obtaining a first key and a second key based on a request to split a partition represented by a rooted tree structure;
    determining a start block of the rooted tree structure based on a search of the rooted tree structure using the obtained first key;
    determining an end block of the rooted tree structure based on a search of the rooted tree structure using the obtained second key;
    determining a middle block of the rooted tree structure based on the start block and the end block;
    determining a split key from the middle block.
PCT/CN2019/081492 2019-04-04 2019-04-04 Split-key estimation method for table partition in disbtributed data storage systems WO2020199192A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/081492 WO2020199192A1 (en) 2019-04-04 2019-04-04 Split-key estimation method for table partition in disbtributed data storage systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/081492 WO2020199192A1 (en) 2019-04-04 2019-04-04 Split-key estimation method for table partition in disbtributed data storage systems

Publications (1)

Publication Number Publication Date
WO2020199192A1 true WO2020199192A1 (en) 2020-10-08

Family

ID=72664619

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/081492 WO2020199192A1 (en) 2019-04-04 2019-04-04 Split-key estimation method for table partition in disbtributed data storage systems

Country Status (1)

Country Link
WO (1) WO2020199192A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120117065A1 (en) * 2010-11-05 2012-05-10 Microsoft Corporation Automated partitioning in parallel database systems
WO2014141802A1 (en) * 2013-03-12 2014-09-18 ソニー株式会社 Information processing device, information processing system, information processing method, and program
CN104750708A (en) * 2013-12-27 2015-07-01 华为技术有限公司 Spatio-temporal data index building and searching methods, a spatio-temporal data index building and searching device and spatio-temporal data index building and searching equipment
CN108932347A (en) * 2018-08-03 2018-12-04 东北大学 A kind of spatial key querying method based on society's perception under distributed environment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120117065A1 (en) * 2010-11-05 2012-05-10 Microsoft Corporation Automated partitioning in parallel database systems
WO2014141802A1 (en) * 2013-03-12 2014-09-18 ソニー株式会社 Information processing device, information processing system, information processing method, and program
CN104750708A (en) * 2013-12-27 2015-07-01 华为技术有限公司 Spatio-temporal data index building and searching methods, a spatio-temporal data index building and searching device and spatio-temporal data index building and searching equipment
CN108932347A (en) * 2018-08-03 2018-12-04 东北大学 A kind of spatial key querying method based on society's perception under distributed environment

Similar Documents

Publication Publication Date Title
US10592532B2 (en) Database sharding
US9519687B2 (en) Minimizing index maintenance costs for database storage regions using hybrid zone maps and indices
US10176223B2 (en) Query plan optimization for large payload columns
US20170068688A1 (en) Efficient spatial queries in large data tables
US11687546B2 (en) Executing conditions with negation operators in analytical databases
US10303793B2 (en) Similarity and ranking of databases based on database metadata
US20150006509A1 (en) Incremental maintenance of range-partitioned statistics for query optimization
US11163743B2 (en) Enhancements for optimizing query executions
US10685031B2 (en) Dynamic hash partitioning for large-scale database management systems
US6421664B1 (en) Apparatus, program product and method for estimating the number of keys within an index key range
GB2516501A (en) Method and system for processing data in a parallel database environment
WO2020199192A1 (en) Split-key estimation method for table partition in disbtributed data storage systems
US20180285419A1 (en) Method of sparse array implementation for large arrays
US10394790B2 (en) Table organization using one or more queries
US20240264994A1 (en) Storage efficient multimaps for processing database queries
US20210294807A1 (en) Combined filtering and indexing for read-only data sets
Sioutas et al. NSM-Tree: Efficient Indexing on Top of NoSQL Databases

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19922415

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19922415

Country of ref document: EP

Kind code of ref document: A1