WO2021142643A1 - Fast partition splitting solution in distributed data storage systems - Google Patents

Fast partition splitting solution in distributed data storage systems Download PDF

Info

Publication number
WO2021142643A1
WO2021142643A1 PCT/CN2020/072149 CN2020072149W WO2021142643A1 WO 2021142643 A1 WO2021142643 A1 WO 2021142643A1 CN 2020072149 W CN2020072149 W CN 2020072149W WO 2021142643 A1 WO2021142643 A1 WO 2021142643A1
Authority
WO
WIPO (PCT)
Prior art keywords
partition
child
splitting
file
data
Prior art date
Application number
PCT/CN2020/072149
Other languages
French (fr)
Inventor
Chen Fu
Chunhui SHEN
Wenlong Yang
Original Assignee
Alibaba Group Holding Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Limited filed Critical Alibaba Group Holding Limited
Priority to PCT/CN2020/072149 priority Critical patent/WO2021142643A1/en
Priority to CN202080083354.0A priority patent/CN114761913A/en
Publication of WO2021142643A1 publication Critical patent/WO2021142643A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Definitions

  • Embodiments of the present disclosure provides a distributed data storage system incorporating a fast partition splitting method.
  • the distributed data storage system comprises a parent partition, a plurality of intermediate partitions, and a plurality of child partitions.
  • the parent partition is split into the plurality of intermediate partitions, including a first intermediate partition. While data copying from the parent partition to the intermediate partitions is taking place, the first intermediate partition is split into the plurality of child partitions.
  • Each of the intermediate partitions comprise a reference file pointing to a data file in the parent partition.
  • Each of the child partitions comprise a reference file pointing to the first intermediate partition’s reference file.
  • the distributed data storage system of the present disclosure also incorporates splitting reference counters to keep track of when a partition may be deleted from the system.
  • Embodiments of the present disclosure provides a method for fast partition splitting.
  • the method comprises splitting a parent partition into a plurality of intermediate partitions and splitting a first intermediate partition of the plurality of intermediate partitions into a plurality of child partitions while the first intermediate partition copies from a parent partition’s data file to a data file of the first intermediate partition.
  • Splitting a parent partition into intermediate partitions comprises initiating a data copying from parent partition’s data file to the data file of the first intermediate partition of the intermediate partitions and establishing a pointer between the parent partition’s data file and a reference file of the first intermediate partition.
  • Splitting the first intermediate partition into a plurality of child partitions comprises initiating a data copying from at least one of the parent partition’s data file or the intermediate partition’s data file into a data file of a child partition of the plurality of child partitions, establishing a first pointer between a first child reference file and the data file of the first intermediate partition, and establishing a second pointer between a second child reference file and the reference file of the first intermediate partition.
  • Embodiments of the present disclosure also provide a non-transitory computer readable medium that stores instructions that are executable by one or more processors to perform a method to fast split partitions, the method comprising: splitting a parent partition into a plurality of intermediate partitions and splitting a first intermediate partition of the plurality of intermediate partitions into a plurality of child partitions while the first intermediate partition copies from a parent partition’s data file to a data file of the first intermediate partition.
  • Splitting a parent partition into intermediate partitions comprises initiating a data copying from parent partition’s data file to the data file of the first intermediate partition of the intermediate partitions and establishing a pointer between the parent partition’s data file and a reference file of the first intermediate partition.
  • Splitting the first intermediate partition into a plurality of child partitions comprises initiating a data copying from at least one of the parent partition’s data file or the intermediate partition’s data file into a data file of a child partition of the plurality of child partitions, establishing a first pointer between a first child reference file and the data file of the first intermediate partition, and establishing a second pointer between a second child reference file and the reference file of the first intermediate partition.
  • FIG. 1A is a schematic diagram illustrating an exemplary distributed data storage system comprising servers having partitions, according to some embodiments of the present disclosure.
  • FIG. 1B is a schematic diagram illustrating an exemplary server of a distributed data storage system, according to some embodiments of the present disclosure.
  • FIG. 2 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting, according to some embodiments of the present disclosure.
  • FIG. 3 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting and splitting reference counters, according to some embodiments of the present disclosure.
  • FIG. 4 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting on multiple data files in the parent partition, according to some embodiments of the present disclosure.
  • FIG. 5 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting involving a 3-way split, according to some embodiments of the present disclosure.
  • FIG. 6 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting involving multiple levels of intermediate partitions, according to some embodiments of the present disclosure.
  • FIG. 7 is a flow diagram of an exemplary method for fast partition splitting in a distributed data storage system, according to some embodiments of the present disclosure.
  • FIG. 8 is a flow diagram of an exemplary method for fast partition splitting incorporating a splitting reference counter in a distributed data storage system, according to some embodiments of the present disclosure.
  • a data table is usually divided into many partitions.
  • partitions When the size of a partition grows, there is a need to split large partitions into smaller partitions in order to keep partition sizes manageable. Partitions having larger sizes slow down the re-assignment process, which leads to slow recovery, difficulty in load balancing, and potential risks of running out of disk space for the underlying server.
  • some conventional systems attempt to optimize the partition splitting by allowing read and write requests to the child partitions while the data copying is underway.
  • a reference file is created in the child partition.
  • the reference file comprises a pointer to the parent partition’s data file and a partition key range for the child partition.
  • a read operation performed on the child partition is translated to a read operation on the parent partition’s data file via the reference file, and the partition key range in the reference file allows the read operation to operate within the child partition’s key range.
  • the distributed data storage system no longer needs to wait for the data copying to finish before performing the read operation.
  • a write operation can be directly served by the child partition.
  • FIG. 1A is a schematic diagram illustrating an exemplary distributed data storage system comprising servers having partitions, according to some embodiments of the present disclosure.
  • exemplary distributed data storage system 100 comprises a plurality of servers 1-N, with each server having 10 different partitions.
  • server 1 comprises partitions A1-A10
  • server 2 comprises partitions B1-B10
  • server N comprises of partitions N1-N10.
  • Distributed data storage system 100 allows each partition to be re-assigned to a different server. It is appreciated that one or more these servers can incorporate fast partition splitting.
  • FIG. 1B is a schematic diagram illustrating an exemplary server 110 of a distributed data storage system, according to some embodiments of the present disclosure.
  • server 110 comprises a bus 112 or other communication mechanism for communicating information, and one or more processors 116 communicatively coupled with bus 112 for processing information.
  • processors 116 can be, for example, one or more microprocessors.
  • Server 110 further comprises storage devices 114, which may include random access memory (RAM) , read only memory (ROM) , and data storage systems comprised of partitions.
  • Storage devices 114 can be communicatively coupled with processors 116 via bus 112.
  • Storage devices 114 may include a main memory, which can be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processors 116. Such instructions, after being stored in non-transitory storage media accessible to processors 116, render server 110 into a special-purpose machine that is customized to perform operations specified in the instructions.
  • Server 110 can transmit data to or communicate with another server 130 through a network 122.
  • Network 122 can be a local network, an internet service provider, internet, or any combination thereof.
  • Communication interface 118 of server 110 is connected to network 122.
  • server 110 can be coupled via bus 112 to peripheral devices 140, which comprises displays (e.g., cathode ray tube (CRT) , liquid crystal display (LCD) , touch screen, etc. ) and input devices (e.g., keyboard, mouse, soft keypad, etc. ) .
  • displays e.g., cathode ray tube (CRT) , liquid crystal display (LCD) , touch screen, etc.
  • input devices e.g., keyboard, mouse, soft keypad, etc.
  • Server 110 can be implemented using customized hard-wired logic, one or more ASICs or FPGAs, firmware, or program logic that in combination with the server causes server 110 to be a special-purpose machine.
  • non-transitory media refers to any non-transitory media storing data or instructions that cause a machine to operate in a specific fashion. Such non-transitory media can comprise non-volatile media and/or volatile media.
  • Non-transitory media include, for example, optical or magnetic disks, dynamic memory, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, flash memory, register, cache, any other memory chip or cartridge, and networked versions of the same.
  • Various forms of media can be involved in carrying one or more sequences of one or more instructions to processors 116 for execution.
  • the instructions can initially be carried out on a magnetic disk or solid state drive of a remote computer.
  • the remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem.
  • a modem local to server 110 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
  • An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 112.
  • Bus 112 carries the data to the main memory within storage devices 114, from which processors 116 retrieves and executes the instructions.
  • FIG. 2 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting, according to some embodiments of the present disclosure.
  • parent partition O comprises a data file O1. data.
  • intermediate partition A and intermediate partition B are created.
  • parent partition O is shut down for serving reading and writing requests, and the distributed data storage system initiates data copying from partition O into intermediate partitions A and B.
  • data copying is performed using a compaction (e.g., major compaction) .
  • a compaction e.g., major compaction
  • all the data files of a target partition e.g., intermediate partition A and intermediate partition B
  • the big data file is placed in a temporary directory during the compaction.
  • the merging finishes and the big data file is validated the big data file is passed over to the partition in a single atomic operation, and the outdated data files and references files are deleted.
  • the compaction simplifies error handling by preventing the partition from having access to partially finished data files.
  • the compaction improves reading performance on the partition by reducing the number of data files in the partition.
  • Intermediate partitions A and B can create data files A1. data and B1. data respectively to store newly written data into intermediate partition A and B.
  • Intermediate partitions A and B can also create references files O1. A. ref and O1.
  • B. ref respectively. Both reference files O1. A. ref and O1.
  • This arrangement of data files and reference files on the intermediate partition allows for the capability to serve read and write requests for the respective data. For example, reading requests served on intermediate partition A and intermediate partition B are directed to O1. data via pointers in reference files O1. A. ref and O1. B. ref, respectively while writing requests served on intermediate partition A and intermediate partition B are performed directly on data files A1. data and B1. data.
  • Child partition C comprises a data file C1. data and two reference files A1. C. ref and O1.
  • Child partition D comprises a data file D1. data and two reference files A1. D. ref and O1.
  • A. D. ref Reference files A1. C. ref and A1.
  • D. ref point to intermediate partition A’s data file A1.
  • A. D. ref point to the intermediate partition A’s reference file O1.
  • intermediate partition A When intermediate partition A is split, intermediate partition A is shut down for serving reading and writing requests, and the distributed data storage system initiates data copying from partition O and intermediate partition A into child partition C and child partition D. In some embodiments, the data copying from partition O and intermediate partition A into child partition C and child partition D is conducted through the compaction.
  • Reading requests on child partition C and child partition D are directed to intermediate partition A’s data file A1. data via reference files A1. C. ref and A1. D. ref and parent partition O’s data file O1. data via reference files O1. A. C. ref and O1. A. D. ref.
  • the distributed data storage system accesses reference file O1.
  • A. C. ref in child partition C and finds that the reference file has a key range [2000, 2499] and a pointer pointing to another reference file O1.
  • A. ref. The distributed data storage system then opens reference file O1.
  • A. ref in intermediate partition A and finds that the reference file has a key range [2000, 2999] and a pointer pointing to data file O1. data.
  • the distributed data storage system intersects all key ranges to get a final range [2000, 2499] , and translates a reading request on reference file O1.
  • A. C. ref to a read request on data file O1.
  • Writing requests on child partition C and child partition D are served directly on data files C1. data and D1. data.
  • One advantage of the distributed data storage system according to some embodiments of the present disclosure is timely splitting.
  • intermediate partition A cannot be split until data copying from partition O to intermediate partition A finishes. Since parent partitions are large in size, data copying takes a long time, which causes significant delays in serving reading and writing requests.
  • the distributed data storage system in the present disclosure allows intermediate partition A to be split right away before data copying finishes, and there is no delay in serving reading and writing requests.
  • Another advantage of the distributed data storage system is reduced data copying.
  • data copying for child partition C and child partition D cannot initiate until data copying from partition O to intermediate partition A finishes.
  • data copying for child partition C and child partition D can initiate right after splitting.
  • data copying for child partition C and child partition D instead of waiting for data copying from partition O to intermediate partition A to finish, data copying for child partition C and child partition D can initiate right after splitting using the compaction, and data copying from partition O to intermediate partition A is abandoned.
  • the distributed data storage system avoids repeated data copying and conserves valuable computing resources for other important tasks.
  • the distributed data storage system implements splitting reference counters to keep track of when a partition may be deleted from the system.
  • Each partition creates a reference splitting counter when it is split. Every time a reference file is created from a child partition pointing to a data file in its parent partition, the splitting reference counter of the parent partition is incremented. Every time a child partition’s reference file pointing to its parent partition’s data file is deleted from the system, the splitting reference counter of the parent partition is decremented. When the parent partition’s splitting reference counter reaches 0, the distributed data storage system deletes the parent partition.
  • FIG. 3 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting and splitting reference counters, according to some embodiments of the present disclosure.
  • parent partition O is split into intermediate partition A and intermediate partition B. Since two reference files O1. A. ref and O1. B. ref were created to point to parent partition O’s data file O1. data, the splitting reference counter of parent partition O is incremented to 2.
  • intermediate partition A is split further into child partition C and child partition D. Since two reference files A1. C. ref and A1. D. ref were created to point to intermediate partition A’s data file A1. data, the splitting reference counter of intermediate partition A is incremented to 2.
  • the distributed data storage system When data copying for child partition C finishes, the distributed data storage system deletes the reference files A1. C. ref and O1. A. C. ref. The splitting reference counter of partition A is then decremented from 2 to 1.
  • the distributed data storage system When data copying for child partition D finishes, the distributed data storage system deletes the reference files A1. D. ref and O1. A. D. ref. The splitting reference counter of partition A is then decremented from 1 to 0, and intermediate partition A is deleted. While intermediate partition A is deleted, reference file O1. A. ref is also deleted, which causes parent partition O’s splitting reference counter to decrement from 2 to 1. If intermediate partition B also deletes reference file O1. B. ref, then parent partition O’s splitting reference counter is decremented from 1 to 0 and parent partition O is thereby deleted.
  • a parent partition contains multiple data files.
  • each of parent partition’s data files may be assigned to a single partition or split into different partitions.
  • FIG. 4 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting on multiple data files in the parent partition, according to some embodiments of the present disclosure.
  • Parent partition O of FIG. 4 contains two data files O1. data and O2. data.
  • data file O1. data is copied into intermediate partition A, and data file O2. data is copied to intermediate partition B.
  • Reference file O1. A. ref from intermediate partition A is created to point to data file O1. data, and reference file O2. B. ref from intermediate partition B is created to point to data file O2. data. Since there are two reference files O1. A. ref and O2. B. ref created to point to data files in parent partition O, the splitting reference counter of parent partition O is incremented to 2.
  • intermediate partition A Before intermediate partition A finishes data copying, intermediate partition A is split further into child partition C and child partition D.
  • intermediate partition A s splitting reference counter is decremented to 0.
  • parent partition O splitting reference counter to decrement from 2 to 1.
  • reference file O2. B. ref is deleted, causing parent partition O’s splitting reference counter to decrement from 1 to 0.
  • the distributed data storage system then deletes parent partition O.
  • FIG. 5 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting involving a 3-way split, according to some embodiments of the present disclosure.
  • Parent partition O of FIG. 5 contains two data files O1. data and O2. data.
  • Parent partition O is split into three partitions: intermediate partition A, intermediate partition B, and intermediate partition E.
  • A. ref from intermediate partition A is created to point to data file O1.
  • B. ref from intermediate partition B is created to point to data file O2. data
  • E ref from intermediate partition E is created to point to data file O2. data. Since there are three reference files O1.
  • FIG. 6 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting involving multiple levels of intermediate partitions, according to some embodiments of the present disclosure.
  • Parent partition O of FIG. 6 contains data file O1. data. Parent partition O is split into two partitions: intermediate partition A and intermediate partition B. Reference file O1. A. ref from intermediate partition A is created to point to data file O1. data, reference file O1. B. ref from intermediate partition B is created to point to data file O1. data. Since there are two reference files O1. A. ref and O1. B. ref created to point to data files in parent partition O, the splitting reference counter of parent partition O is incremented to 2. Before intermediate partition A finishes data copying, intermediate partition A is split further into child partition C and child partition D.
  • Child partition C Before child partition C finishes data copying, child partition C is split further into child partition F and child partition G. Child partition C becomes an intermediate partition since it is split before the data copying finishes.
  • Child partition F has one data file F1. data and three reference files C1.
  • F. ref A1. C. F. ref and O1.
  • A. C. F. ref and child partition G has one data file G1. data and three reference files C1.
  • G. ref A1. C. G. ref and O1.
  • F. ref and C1. G. ref point to data file C1.
  • C. G. ref point to A1.
  • C. ref which points to data file A1.
  • Embodiments of the present disclosure further provides a method for fast partition splitting.
  • FIG. 7 illustrates a flow diagram of an exemplary method 1000 for fast partition splitting in a distributed data storage system, according to some embodiments of the present disclosure. It is appreciated that method 1000 can be performed by a distributed data storage system (e.g., distributed data storage system 100 of FIG. 1A) or by one or more servers (e.g., exemplary server 110 of FIG. 1B) .
  • a distributed data storage system e.g., distributed data storage system 100 of FIG. 1A
  • servers e.g., exemplary server 110 of FIG. 1B
  • a parent partition having a data file is split into at least two intermediate partitions, each having a corresponding reference file.
  • parent partition O is split into intermediate partitions A and B.
  • step 1020 data copying is initiated from the parent partition to intermediate partitions, including a first intermediate partition. For example, as shown in FIG. 2, data copying is initiated from parent partition O to intermediate partition A and intermediate partition B.
  • a pointer is established between the data file of the parent partition and a reference file in the first intermediate partition. For example, as shown in FIG. 2, reference file O1. A. ref is created to point to parent partition O’s data file O1. data.
  • the first intermediate partition is split into a plurality of child partitions while data copying for the first intermediate partition is occurring.
  • Each of the child partitions has a corresponding reference file.
  • intermediate partition A is split into child partitions C and D.
  • Child partition C has a reference file O1.
  • A. C. ref and child partition D has a reference file O1.
  • step 1050 data copying is initiated from the parent partition and the first intermediate partition to the child partitions.
  • child partitions C copies data from data files A1. data and O1. data into C1.
  • child partition D copies data from data files A1. data and O1. data into D1. data.
  • step 1060 pointers are established between the data file of the parent partition and each of the reference files in the child partitions via the reference file in the first intermediate partition.
  • references files O1. A. C. ref and O1.
  • A. D. ref are created to point to reference file O1.
  • A. ref which points to data file O1. data.
  • method 1000 further comprises additional steps involving splitting reference counters.
  • FIG. 8 illustrates a flow diagram of an exemplary method for fast partition splitting incorporating splitting reference counters in a distributed data storage system, according to some embodiments of the present disclosure.
  • method 1000 in FIG. 8 further comprises step 1035, step 1065, step 1075, step 1085 and step 1095. It is appreciated that method 1000 can be performed by a distributed data storage system (e.g., distributed data storage system 100 of FIG. 1A) or by one or more servers (e.g., exemplary server 110 of FIG. 1B) .
  • a distributed data storage system e.g., distributed data storage system 100 of FIG. 1A
  • servers e.g., exemplary server 110 of FIG. 1B
  • Step 1035 is performed after step 1030.
  • parent partition’s splitting reference counter is incremented by the number of new reference files created to point to the parent partition’s data file. For example, as shown in FIG. 3, since two new reference files O1 .A. ref and O1. B. ref are created, parent partition O’s splitting reference counter is incremented from 0 to 2.
  • Step 1065 is performed after step 1060.
  • first intermediate partition’s splitting reference counter is incremented by the number of new reference files created to point to the first intermediate partition’s data file. For example, as shown in FIG. 3, since two new reference files A1. C. ref and A1. D. ref are created to point to the intermediate partition A’s data file A1. data, intermediate partition A’s splitting reference counter is incremented from 0 to 2.
  • first intermediate partition’s splitting reference counter is decremented by 1 each time a child partition finishes data copying and the child’s reference file pointing to first intermediate partition’s data file is deleted. For example, as shown in FIG. 3, when child partition C finishes data copying, reference file A1. C. ref is deleted, and intermediate partition A’s splitting reference counter is decremented from 2 to 1.
  • parent partition’s splitting reference counter is decremented by 1 each time an intermediate partition finishes data copying and the intermediate partition’s reference file pointing to parent partition’s data file is deleted. For example, as shown in FIG. 3, when intermediate partition A’s splitting reference counter reaches 0, intermediate partition A is deleted, which causes parent partition O’s splitting reference counter to decrement from 2 to 1.
  • step 1095 when a parent partitions splitting reference counter or an intermediate partition’s splitting reference counter reaches 0, the parent partition or the intermediate partition is deleted.
  • the parent partition or the intermediate partition is deleted. For example, as shown in FIG. 3, when intermediate partition B finishes data copying, reference file O1. B. ref is deleted, which causes parent partition O’s splitting reference counter to decrement from 1 to 0. Parent partition O is then deleted.
  • a computer-readable medium may include removeable and nonremovable storage devices including, but not limited to, Read Only Memory (ROM) , Random Access Memory (RAM) , compact discs (CDs) , digital versatile discs (DVD) , etc.
  • program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A distributed data storage system incorporating a fast partition splitting method is provided. The distributed data storage system comprises a parent partition, a plurality of intermediate partitions, and a plurality of child partitions. The parent partition is split into the plurality of intermediate partitions, including a first intermediate partition. While data copying from the parent partition to the first intermediate partitions is taking place, the first intermediate partition is split into the plurality of child partitions. Each of the intermediate partitions comprise a reference file pointing to a data file in the parent partition. Each of the child partitions comprise a reference file pointing to the first intermediate partition's reference file, and another reference file pointing to a data file in the first intermediate partition. The distributed data storage system also incorporates splitting reference counters to keep track of when a partition may be deleted from the system. In addition, a method for fast partition splitting, and a non-transitory computer readable medium storing instructions that are executable by one or more processors to perform a method for fast partition splitting are provided. The method can effectively reduce access overhaul with conventional systems resulting from partition splitting.

Description

A FAST PARTITION SPLITTING SOLUTION IN DISTRIBUTED DATA STORAGE SYSTEMS BACKGROUND
In distributed data storage systems, it is often not feasible to store all data in a single table. Instead, such systems often divide a data table into distinct parts called partitions. When the data volume of a table grows, there is a need to split partitions to keep partition sizes manageable. Since partitions often remain inaccessible while the splitting is taking place, partition splitting can cause significant delays in the distributed data storage system. Accordingly, conventional distributed data storage systems need improvements.
SUMMARY
Embodiments of the present disclosure provides a distributed data storage system incorporating a fast partition splitting method. The distributed data storage system comprises a parent partition, a plurality of intermediate partitions, and a plurality of child partitions. The parent partition is split into the plurality of intermediate partitions, including a first intermediate partition. While data copying from the parent partition to the intermediate partitions is taking place, the first intermediate partition is split into the plurality of child partitions. Each of the intermediate partitions comprise a reference file pointing to a data file in the parent partition. Each of the child partitions comprise a reference file pointing to the first intermediate partition’s reference file. The distributed data storage system of the present disclosure also incorporates splitting reference counters to keep track of when a partition may be deleted from the system.
Embodiments of the present disclosure provides a method for fast partition splitting. The method comprises splitting a parent partition into a plurality of intermediate partitions and splitting a first intermediate partition of the plurality of intermediate partitions into  a plurality of child partitions while the first intermediate partition copies from a parent partition’s data file to a data file of the first intermediate partition. Splitting a parent partition into intermediate partitions comprises initiating a data copying from parent partition’s data file to the data file of the first intermediate partition of the intermediate partitions and establishing a pointer between the parent partition’s data file and a reference file of the first intermediate partition. Splitting the first intermediate partition into a plurality of child partitions comprises initiating a data copying from at least one of the parent partition’s data file or the intermediate partition’s data file into a data file of a child partition of the plurality of child partitions, establishing a first pointer between a first child reference file and the data file of the first intermediate partition, and establishing a second pointer between a second child reference file and the reference file of the first intermediate partition.
Embodiments of the present disclosure also provide a non-transitory computer readable medium that stores instructions that are executable by one or more processors to perform a method to fast split partitions, the method comprising: splitting a parent partition into a plurality of intermediate partitions and splitting a first intermediate partition of the plurality of intermediate partitions into a plurality of child partitions while the first intermediate partition copies from a parent partition’s data file to a data file of the first intermediate partition. Splitting a parent partition into intermediate partitions comprises initiating a data copying from parent partition’s data file to the data file of the first intermediate partition of the intermediate partitions and establishing a pointer between the parent partition’s data file and a reference file of the first intermediate partition. Splitting the first intermediate partition into a plurality of child partitions comprises initiating a data copying from at least one of the parent partition’s data file or the intermediate partition’s data file into a data file of a child partition of the plurality of child  partitions, establishing a first pointer between a first child reference file and the data file of the first intermediate partition, and establishing a second pointer between a second child reference file and the reference file of the first intermediate partition.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments and various aspects of the present disclosure are illustrated in the following detailed description and the accompanying figures. Various features shown in the figures are not drawn to scale.
FIG. 1A is a schematic diagram illustrating an exemplary distributed data storage system comprising servers having partitions, according to some embodiments of the present disclosure.
FIG. 1B is a schematic diagram illustrating an exemplary server of a distributed data storage system, according to some embodiments of the present disclosure.
FIG. 2 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting, according to some embodiments of the present disclosure.
FIG. 3 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting and splitting reference counters, according to some embodiments of the present disclosure.
FIG. 4 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting on multiple data files in the parent partition, according to some embodiments of the present disclosure.
FIG. 5 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting involving a 3-way split, according to some embodiments of the present disclosure.
FIG. 6 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting involving multiple levels of intermediate partitions, according to some embodiments of the present disclosure.
FIG. 7 is a flow diagram of an exemplary method for fast partition splitting in a distributed data storage system, according to some embodiments of the present disclosure.
FIG. 8 is a flow diagram of an exemplary method for fast partition splitting incorporating a splitting reference counter in a distributed data storage system, according to some embodiments of the present disclosure.
DETAILED DESCRIPTION
The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the disclosure. Instead, they are merely examples of apparatuses and methods consistent with aspects related to the disclosure as recited in the appended claims.
In distributed data storage systems, a data table is usually divided into many partitions. When the size of a partition grows, there is a need to split large partitions into smaller partitions in order to keep partition sizes manageable. Partitions having larger sizes slow down the re-assignment process, which leads to slow recovery, difficulty in load balancing, and potential risks of running out of disk space for the underlying server.
Conventional distributed data storage systems have various shortcomings in partition splitting. When a parent partition is split into two child partitions in a conventional distributed data storage system, access to the parent partition must be shut down first before data  in the parent partition are copied into the child partitions. While the data copying is underway, access to the two child partitions are not granted until the data copying is finished. Since the data copying can take a long time to process, data in the parent partition remains unavailable, resulting in a significant decrease of access efficiency in the distributed data storage system.
To mitigate this problem, some conventional systems attempt to optimize the partition splitting by allowing read and write requests to the child partitions while the data copying is underway. For each data file in the parent partition, a reference file is created in the child partition. The reference file comprises a pointer to the parent partition’s data file and a partition key range for the child partition. As a result, a read operation performed on the child partition is translated to a read operation on the parent partition’s data file via the reference file, and the partition key range in the reference file allows the read operation to operate within the child partition’s key range. The distributed data storage system no longer needs to wait for the data copying to finish before performing the read operation. A write operation can be directly served by the child partition.
Although this solution improves access efficiency, it suffers some significant downsides. When the child partition is conducting data copying from the parent partition, the child partition cannot be split further until the child partition finishes data copying. When the distributed data storage system decides to split a partition, that partition most likely is already under heavy traffic, and there exists a need to further split a child partition immediately after the parent partition has been split. Since data copying often takes a long time to complete, the distributed data storage system would experience a big overhead in managing access to the child partitions that are being split further.
Embodiments of the present disclosure provides a system incorporating fast partition splitting to mitigate the issues with conventional systems. The fast partition splitting allows instant splitting for child partitions. FIG. 1A is a schematic diagram illustrating an exemplary distributed data storage system comprising servers having partitions, according to some embodiments of the present disclosure. According to FIG. 1A, exemplary distributed data storage system 100 comprises a plurality of servers 1-N, with each server having 10 different partitions. For example, server 1 comprises partitions A1-A10, server 2 comprises partitions B1-B10, and server N comprises of partitions N1-N10. Distributed data storage system 100 allows each partition to be re-assigned to a different server. It is appreciated that one or more these servers can incorporate fast partition splitting.
FIG. 1B is a schematic diagram illustrating an exemplary server 110 of a distributed data storage system, according to some embodiments of the present disclosure. According to FIG. 1B, server 110 comprises a bus 112 or other communication mechanism for communicating information, and one or more processors 116 communicatively coupled with bus 112 for processing information. Processors 116 can be, for example, one or more microprocessors.
Server 110 further comprises storage devices 114, which may include random access memory (RAM) , read only memory (ROM) , and data storage systems comprised of partitions. Storage devices 114 can be communicatively coupled with processors 116 via bus 112. Storage devices 114 may include a main memory, which can be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processors 116. Such instructions, after being stored in non-transitory storage media accessible to  processors 116, render server 110 into a special-purpose machine that is customized to perform operations specified in the instructions.
Server 110 can transmit data to or communicate with another server 130 through a network 122. Network 122 can be a local network, an internet service provider, internet, or any combination thereof. Communication interface 118 of server 110 is connected to network 122. In addition, server 110 can be coupled via bus 112 to peripheral devices 140, which comprises displays (e.g., cathode ray tube (CRT) , liquid crystal display (LCD) , touch screen, etc. ) and input devices (e.g., keyboard, mouse, soft keypad, etc. ) .
Server 110 can be implemented using customized hard-wired logic, one or more ASICs or FPGAs, firmware, or program logic that in combination with the server causes server 110 to be a special-purpose machine.
The term “non-transitory media” as used herein refers to any non-transitory media storing data or instructions that cause a machine to operate in a specific fashion. Such non-transitory media can comprise non-volatile media and/or volatile media. Non-transitory media include, for example, optical or magnetic disks, dynamic memory, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, flash memory, register, cache, any other memory chip or cartridge, and networked versions of the same.
Various forms of media can be involved in carrying one or more sequences of one or more instructions to processors 116 for execution. For example, the instructions can initially be carried out on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line  using a modem. A modem local to server 110 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 112. Bus 112 carries the data to the main memory within storage devices 114, from which processors 116 retrieves and executes the instructions.
FIG. 2 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting, according to some embodiments of the present disclosure. In this example, parent partition O comprises a data file O1. data. When parent partition O is split, as shown in FIG. 2, intermediate partition A and intermediate partition B are created.
Moreover, during the split, parent partition O is shut down for serving reading and writing requests, and the distributed data storage system initiates data copying from partition O into intermediate partitions A and B. In some embodiments, data copying is performed using a compaction (e.g., major compaction) . In the compaction, all the data files of a target partition (e.g., intermediate partition A and intermediate partition B) , including the reference files of the partition, are merged into a single big data file. The big data file is placed in a temporary directory during the compaction. When the merging finishes and the big data file is validated, the big data file is passed over to the partition in a single atomic operation, and the outdated data files and references files are deleted. The compaction simplifies error handling by preventing the partition from having access to partially finished data files. In addition, the compaction improves reading performance on the partition by reducing the number of data files in the partition.
Intermediate partitions A and B can create data files A1. data and B1. data respectively to store newly written data into intermediate partition A and B. Intermediate  partitions A and B can also create references files O1. A. ref and O1. B. ref, respectively. Both reference files O1. A. ref and O1. B. ref point to parent partition’s data file O1. data.
This arrangement of data files and reference files on the intermediate partition allows for the capability to serve read and write requests for the respective data. For example, reading requests served on intermediate partition A and intermediate partition B are directed to O1. data via pointers in reference files O1. A. ref and O1. B. ref, respectively while writing requests served on intermediate partition A and intermediate partition B are performed directly on data files A1. data and B1. data.
Before the data copying from partition O to intermediate partition A is finished, intermediate partition A is split further into child partition C and child partition D. Child partition C comprises a data file C1. data and two reference files A1. C. ref and O1. A. C. ref, while child partition D comprises a data file D1. data and two reference files A1. D. ref and O1. A. D. ref. Reference files A1. C. ref and A1. D. ref point to intermediate partition A’s data file A1. data. Reference files O1. A. C. ref and O1. A. D. ref point to the intermediate partition A’s reference file O1. A. ref, which points to parent partition O’s data file O1. data. When intermediate partition A is split, intermediate partition A is shut down for serving reading and writing requests, and the distributed data storage system initiates data copying from partition O and intermediate partition A into child partition C and child partition D. In some embodiments, the data copying from partition O and intermediate partition A into child partition C and child partition D is conducted through the compaction.
Reading requests on child partition C and child partition D are directed to intermediate partition A’s data file A1. data via reference files A1. C. ref and A1. D. ref and parent partition O’s data file O1. data via reference files O1. A. C. ref and O1. A. D. ref. For example, to  perform a reading request on child partition C, the distributed data storage system accesses reference file O1. A. C. ref in child partition C and finds that the reference file has a key range [2000, 2499] and a pointer pointing to another reference file O1. A. ref. The distributed data storage system then opens reference file O1. A. ref in intermediate partition A and finds that the reference file has a key range [2000, 2999] and a pointer pointing to data file O1. data. The distributed data storage system intersects all key ranges to get a final range [2000, 2499] , and translates a reading request on reference file O1. A. C. ref to a read request on data file O1. data filtered by the final range [2000, 2499] . Writing requests on child partition C and child partition D are served directly on data files C1. data and D1. data.
One advantage of the distributed data storage system according to some embodiments of the present disclosure is timely splitting. In conventional systems, intermediate partition A cannot be split until data copying from partition O to intermediate partition A finishes. Since parent partitions are large in size, data copying takes a long time, which causes significant delays in serving reading and writing requests. The distributed data storage system in the present disclosure allows intermediate partition A to be split right away before data copying finishes, and there is no delay in serving reading and writing requests.
Another advantage of the distributed data storage system according to some embodiments of the present disclosure is reduced data copying. In conventional systems, data copying for child partition C and child partition D cannot initiate until data copying from partition O to intermediate partition A finishes. In the present disclosure, data copying for child partition C and child partition D can initiate right after splitting. In some embodiments, instead of waiting for data copying from partition O to intermediate partition A to finish, data copying for child partition C and child partition D can initiate right after splitting using the compaction,  and data copying from partition O to intermediate partition A is abandoned. As a result, the distributed data storage system avoids repeated data copying and conserves valuable computing resources for other important tasks.
In some embodiments of the present disclosure, the distributed data storage system implements splitting reference counters to keep track of when a partition may be deleted from the system. Each partition creates a reference splitting counter when it is split. Every time a reference file is created from a child partition pointing to a data file in its parent partition, the splitting reference counter of the parent partition is incremented. Every time a child partition’s reference file pointing to its parent partition’s data file is deleted from the system, the splitting reference counter of the parent partition is decremented. When the parent partition’s splitting reference counter reaches 0, the distributed data storage system deletes the parent partition.
FIG. 3 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting and splitting reference counters, according to some embodiments of the present disclosure. According to FIG. 3, parent partition O is split into intermediate partition A and intermediate partition B. Since two reference files O1. A. ref and O1. B. ref were created to point to parent partition O’s data file O1. data, the splitting reference counter of parent partition O is incremented to 2. Before data copying is finished from parent partition O to intermediate partition A, intermediate partition A is split further into child partition C and child partition D. Since two reference files A1. C. ref and A1. D. ref were created to point to intermediate partition A’s data file A1. data, the splitting reference counter of intermediate partition A is incremented to 2.
When data copying for child partition C finishes, the distributed data storage system deletes the reference files A1. C. ref and O1. A. C. ref. The splitting reference counter of  partition A is then decremented from 2 to 1. When data copying for child partition D finishes, the distributed data storage system deletes the reference files A1. D. ref and O1. A. D. ref. The splitting reference counter of partition A is then decremented from 1 to 0, and intermediate partition A is deleted. While intermediate partition A is deleted, reference file O1. A. ref is also deleted, which causes parent partition O’s splitting reference counter to decrement from 2 to 1. If intermediate partition B also deletes reference file O1. B. ref, then parent partition O’s splitting reference counter is decremented from 1 to 0 and parent partition O is thereby deleted.
In some embodiments of the present disclosure, a parent partition contains multiple data files. When the parent partition splits into two other partitions, each of parent partition’s data files may be assigned to a single partition or split into different partitions. FIG. 4 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting on multiple data files in the parent partition, according to some embodiments of the present disclosure. Parent partition O of FIG. 4 contains two data files O1. data and O2. data. When parent partition O is split into intermediate partition A and intermediate partition B, data file O1. data is copied into intermediate partition A, and data file O2. data is copied to intermediate partition B.
Reference file O1. A. ref from intermediate partition A is created to point to data file O1. data, and reference file O2. B. ref from intermediate partition B is created to point to data file O2. data. Since there are two reference files O1. A. ref and O2. B. ref created to point to data files in parent partition O, the splitting reference counter of parent partition O is incremented to 2.
Before intermediate partition A finishes data copying, intermediate partition A is split further into child partition C and child partition D. When child partition C and child  partition D finish data copying, intermediate partition A’s splitting reference counter is decremented to 0. As a result, intermediate partition A is deleted, causing parent partition O’s splitting reference counter to decrement from 2 to 1. When intermediate partition B finishes data copying, reference file O2. B. ref is deleted, causing parent partition O’s splitting reference counter to decrement from 1 to 0. The distributed data storage system then deletes parent partition O.
In some embodiments of the present disclosure, the number of new partitions created as a result of splitting is more than 2. FIG. 5 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting involving a 3-way split, according to some embodiments of the present disclosure. Parent partition O of FIG. 5 contains two data files O1. data and O2. data. Parent partition O is split into three partitions: intermediate partition A, intermediate partition B, and intermediate partition E. Reference file O1. A. ref from intermediate partition A is created to point to data file O1. data, reference file O2. B. ref from intermediate partition B is created to point to data file O2. data, and reference file O2. E. ref from intermediate partition E is created to point to data file O2. data. Since there are three reference files O1. A. ref, O2. B. ref and O2. E. ref created to point to data files in parent partition O, the splitting reference counter of parent partition O is incremented to 3. Before intermediate partition A finishes data copying, intermediate partition A is split further into child partition C and child partition D. When child partition C and child partition D finish data copying, intermediate partition A’s splitting reference counter is decremented to 0. As a result, intermediate partition A is deleted, causing parent partition O’s splitting reference counter to decrement from 3 to 2. When intermediate partition B and intermediate partition E finish data copying, reference files  O2. B. ref and O2. E. ref are deleted, causing parent partition O’s splitting reference counter to decrement from 2 to 0. The distributed data storage system then deletes parent partition O.
In some embodiments of the present disclosure, there can be multiple levels of intermediate partitions. The name intermediate partition is used to describe partitions that are split before data copying finishes. If a child partition is split before it finishes data copying, the child partition becomes an intermediate partition. FIG. 6 is a schematic diagram illustrating an exemplary distributed data storage system using fast partition splitting involving multiple levels of intermediate partitions, according to some embodiments of the present disclosure. Parent partition O of FIG. 6 contains data file O1. data. Parent partition O is split into two partitions: intermediate partition A and intermediate partition B. Reference file O1. A. ref from intermediate partition A is created to point to data file O1. data, reference file O1. B. ref from intermediate partition B is created to point to data file O1. data. Since there are two reference files O1. A. ref and O1. B. ref created to point to data files in parent partition O, the splitting reference counter of parent partition O is incremented to 2. Before intermediate partition A finishes data copying, intermediate partition A is split further into child partition C and child partition D.
Before child partition C finishes data copying, child partition C is split further into child partition F and child partition G. Child partition C becomes an intermediate partition since it is split before the data copying finishes. Child partition F has one data file F1. data and three reference files C1. F. ref, A1. C. F. ref and O1. A. C. F. ref, and child partition G has one data file G1. data and three reference files C1. G. ref, A1. C. G. ref and O1. A. C. G. ref. Refence files C1. F. ref and C1. G. ref point to data file C1. data. Reference files A1. C. F. ref and A1. C. G. ref point to A1. C. ref, which points to data file A1. data. Reference files O1. A. C. F. ref and O1. A. C. G. ref point to O1. A. C. ref, which points to O1. A. ref that points to data file O1. data. When child partition F  and child partition G finish data copying, intermediate partition C’s splitting reference counter is decremented to 0. As a result, intermediate partition C is deleted, causing parent partition A’s splitting reference counter to decrement from 2 to 1. When child partition D finishes data copying, intermediate partition A’s splitting reference counter is decremented to 0. As a result, intermediate partition A is deleted, causing parent partition O’s splitting reference counter to decrement from 2 to 1. When intermediate partition B finishes data copying, reference file O2. B. ref is deleted, causing parent partition O’s splitting reference counter to decrement from 1 to 0. The distributed data storage system then deletes parent partition O.
Embodiments of the present disclosure further provides a method for fast partition splitting. FIG. 7 illustrates a flow diagram of an exemplary method 1000 for fast partition splitting in a distributed data storage system, according to some embodiments of the present disclosure. It is appreciated that method 1000 can be performed by a distributed data storage system (e.g., distributed data storage system 100 of FIG. 1A) or by one or more servers (e.g., exemplary server 110 of FIG. 1B) .
In step 1010, a parent partition having a data file is split into at least two intermediate partitions, each having a corresponding reference file. For example, as shown in FIG. 2, parent partition O is split into intermediate partitions A and B.
In step 1020, data copying is initiated from the parent partition to intermediate partitions, including a first intermediate partition. For example, as shown in FIG. 2, data copying is initiated from parent partition O to intermediate partition A and intermediate partition B.
In step 1030, a pointer is established between the data file of the parent partition and a reference file in the first intermediate partition. For example, as shown in FIG. 2, reference file O1. A. ref is created to point to parent partition O’s data file O1. data.
In step 1040, the first intermediate partition is split into a plurality of child partitions while data copying for the first intermediate partition is occurring. Each of the child partitions has a corresponding reference file. For example, as shown in FIG. 2, intermediate partition A is split into child partitions C and D. Child partition C has a reference file O1. A. C. ref, and child partition D has a reference file O1. A. D. ref.
In step 1050, data copying is initiated from the parent partition and the first intermediate partition to the child partitions. For example, as shown in FIG. 2, child partitions C copies data from data files A1. data and O1. data into C1. data, and child partition D copies data from data files A1. data and O1. data into D1. data.
In step 1060, pointers are established between the data file of the parent partition and each of the reference files in the child partitions via the reference file in the first intermediate partition. For example, as shown in FIG. 2, references files O1. A. C. ref and O1. A. D. ref are created to point to reference file O1. A. ref, which points to data file O1. data.
In some embodiments, method 1000 further comprises additional steps involving splitting reference counters. FIG. 8 illustrates a flow diagram of an exemplary method for fast partition splitting incorporating splitting reference counters in a distributed data storage system, according to some embodiments of the present disclosure. On the basis of FIG. 7, method 1000 in FIG. 8 further comprises step 1035, step 1065, step 1075, step 1085 and step 1095. It is appreciated that method 1000 can be performed by a distributed data storage system (e.g., distributed data storage system 100 of FIG. 1A) or by one or more servers (e.g., exemplary server 110 of FIG. 1B) .
Step 1035 is performed after step 1030. In step 1035, parent partition’s splitting reference counter is incremented by the number of new reference files created to point to the  parent partition’s data file. For example, as shown in FIG. 3, since two new reference files O1 .A. ref and O1. B. ref are created, parent partition O’s splitting reference counter is incremented from 0 to 2.
Step 1065 is performed after step 1060. In step 1065, first intermediate partition’s splitting reference counter is incremented by the number of new reference files created to point to the first intermediate partition’s data file. For example, as shown in FIG. 3, since two new reference files A1. C. ref and A1. D. ref are created to point to the intermediate partition A’s data file A1. data, intermediate partition A’s splitting reference counter is incremented from 0 to 2.
In step 1075, first intermediate partition’s splitting reference counter is decremented by 1 each time a child partition finishes data copying and the child’s reference file pointing to first intermediate partition’s data file is deleted. For example, as shown in FIG. 3, when child partition C finishes data copying, reference file A1. C. ref is deleted, and intermediate partition A’s splitting reference counter is decremented from 2 to 1.
In step 1085, parent partition’s splitting reference counter is decremented by 1 each time an intermediate partition finishes data copying and the intermediate partition’s reference file pointing to parent partition’s data file is deleted. For example, as shown in FIG. 3, when intermediate partition A’s splitting reference counter reaches 0, intermediate partition A is deleted, which causes parent partition O’s splitting reference counter to decrement from 2 to 1.
In step 1095, when a parent partitions splitting reference counter or an intermediate partition’s splitting reference counter reaches 0, the parent partition or the intermediate partition is deleted. For example, as shown in FIG. 3, when intermediate partition B finishes data copying, reference file O1. B. ref is deleted, which causes parent partition O’s splitting reference counter to decrement from 1 to 0. Parent partition O is then deleted.
The various example embodiments described herein are described in the general context of method steps or processes, which may be implemented in one aspect by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments. A computer-readable medium may include removeable and nonremovable storage devices including, but not limited to, Read Only Memory (ROM) , Random Access Memory (RAM) , compact discs (CDs) , digital versatile discs (DVD) , etc. Generally, program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.
In the foregoing specification, embodiments have been described with reference to numerous specific details that can vary from implementation to implementation. Certain adaptations and modifications of the described embodiments can be made. Other embodiments can be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims. It is also intended that the sequence of steps shown in figures are only for illustrative purposes and are not intended to be limited to any particular sequence of steps. As such, those skilled in the art can appreciate that these steps can be performed in a different order while implementing the same method.
In the drawings and specification, there have been disclosed exemplary embodiments. Many variations and modifications, however, can be made to these embodiments. Accordingly, although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation, the scope of the embodiments being defined by the following claims.

Claims (30)

  1. A distributed data storage system, comprising:
    a parent partition comprising a parent data file;
    an intermediate partition split from the parent partition, the intermediate partition comprising an intermediate data file and an intermediate reference file pointing to the parent data file, wherein the intermediate partition is configured to initiate data copying while the intermediate partition remains accessible for reading and writing; and
    a first child partition split from the intermediate partition during intermediate partition’s data copying, the first child partition comprising a first child data file and a first and a second child reference files, wherein the first child reference file is configured to point to the intermediate data file, the second child reference file is configured to point to the intermediate reference file, and the first child partition is configured to initiate data copying while the first child partition remains accessible for reading and writing.
  2. The distributed data storage system according to claim 1, further comprising:
    a second child partition split from the intermediate partition during intermediate partition’s data copying, the second child partition comprising a second child data file and a third and a fourth child reference files, wherein the third child reference file is configured to point to the intermediate data file, the fourth child reference file is configured to point to the intermediate reference file, and the second child partition is configured to initiate data copying while the second child partition remains accessible for reading and writing.
  3. The distributed data storage system according to claim 1 or 2, further comprising:
    a third child partition split from the first child partition during first child partition’s data copying, the third child partition comprising a third child data file and a fifth, a sixth and a seventh child reference files, wherein the fifth child reference file is configured to point to the first child’s data file, the sixth child reference file is configured to point to the first reference file, the seventh child reference file is configured to point to the second reference file, and the third child partition is configured to initiate data copying while the third child partition remains accessible for reading and writing.
  4. The distributed data storage system according to any one of claims 1-3, wherein the data copying is performed using a compaction.
  5. The distributed data storage system according to any one of claims 1-4, wherein the intermediate partition further comprises a splitting reference counter configured to:
    increment by a number of new reference files of child partitions that point to the intermediate partition’s data file; and
    decrement by 1 in response to a child partition finishing data copying.
  6. The distributed data storage system according to claim 5, wherein the intermediate partition is further configured to be deleted in response to the splitting reference counter of the intermediate partition reaching 0.
  7. The distributed data storage system according to any one of claims 1-6, wherein an intermediate partition’s intermediate reference file is configured to be deleted in response to the intermediate partition finishing data copying.
  8. The distributed data storage system according to any one of claims 1-7, wherein a child partition’s child reference files are configured to be deleted in response to the child partition finishing data copying.
  9. The distributed data storage system according to any one of claims 1-8, wherein the parent partition further comprises a splitting reference counter configured to:
    increment by a number of new reference files that points to the parent data file; and
    decremented by 1 in response to an intermediate partition finishing data copying.
  10. The distributed data storage system according to claim 9, wherein the parent partition is configured to be deleted in response to the splitting reference counter of the parent partition reaching 0.
  11. A method to split partitions in a distributed data storage system, comprising:
    splitting a parent partition into intermediate partitions, the splitting comprising:
    initiating a data copying from a parent partition to a first intermediate partition of the intermediate partitions, and
    establishing a pointer between a data file of the parent partition and a reference file of the first intermediate partition; and
    splitting the first intermediate partition into a plurality of child partitions while the data copying from the parent partition to the first intermediate partition is occurring, the splitting of the first intermediate partition comprising:
    initiating a data copying from at least one of the parent partition or the intermediate partition to a first child partition of the plurality of child partitions,
    establishing a first pointer between a first child reference file in the first child partition and a data file of the first intermediate partition, and
    establishing a second pointer between a second child reference file in the first child partition and the reference file of the first intermediate partition.
  12. The method according to claim 11, wherein splitting the first intermediate partition into a plurality of child partitions further comprising:
    initiating a data copying from at least one of the parent partition or the intermediate partition to a second child partition of the plurality of child partitions,
    establishing a third pointer between a third child reference file in the second child partition and the data file of the first intermediate partition, and
    establishing a fourth pointer between a fourth child reference file in the second child partition and the reference file of the first intermediate partition.
  13. The method according to claim 11 or 12, further comprising:
    splitting the first child partition while the data copying for the first child partition is occurring, the splitting of the first child partition comprising:
    initiating a data copying from at least one of the parent partition, the first intermediate partition or the first child partition to a third child partition,
    establishing a fifth pointer between a fifth child reference file in the third child partition and a data file of the first child partition,
    establishing a sixth pointer between the sixth child reference file in the third child partition and the first child reference file in the first child partition, and
    establishing a seventh pointer between the seventh child reference file in the third child partition and the second child reference file in the first child partition.
  14. The method according to any one of claims 11-13, wherein the data copying is performed using a compaction.
  15. The method according to any one of claims 11-14, wherein splitting the first intermediate partition into a plurality of child partitions further comprising establishing a splitting reference counter for the first intermediate partition, wherein the splitting reference counter is incremented by a number of new reference files of child partitions that point to the first intermediate partition’s data file and is decremented by 1 in response to a child partition finishing data copying.
  16. The method according to claim 15, wherein splitting the first intermediate partition into a plurality of child partitions further comprising:
    deleting the first intermediate partition in response to the splitting reference counter of the first intermediate partition reaching 0.
  17. The method according to any one of claims 11-16, wherein splitting the parent partition into intermediate partitions further comprising:
    deleting the reference file of the first intermediate partition in response to the first intermediate partition finishing data copying.
  18. The method according to any one of claims 11-17, wherein splitting the first intermediate partition into a plurality of child partitions further comprising:
    deleting the first child reference file and the second child reference file in response to the first child partition finishing data copying.
  19. The method according to any one of claims 11-18, wherein splitting the parent partition into intermediate partitions further comprising establishing a splitting reference counter for the parent partition, wherein the splitting reference counter is incremented by a number of new reference files that points to the data file of the parent partition and is decremented by 1 in response to an intermediate partition finishing data copying.
  20. The method according to claim 19, wherein splitting the parent partition into intermediate partitions further comprising deleting the parent partition in response to the splitting reference counter of the parent partition reaching 0.
  21. A non-transitory computer readable medium that stores a set of instructions that are executable by one or more processors of an apparatus to perform a method to split partitions in a distributed data storage system, the method comprising:
    splitting a parent partition into intermediate partitions, the splitting comprising:
    initiating a data copying from a parent partition to a first intermediate partition of the intermediate partitions, and
    establishing a pointer between a data file of the parent partition and a reference file of the first intermediate partition; and
    splitting the first intermediate partition into a plurality of child partitions while the data copying from the parent partition to the first intermediate partition is occurring, the splitting of the first intermediate partition comprising
    initiating a data copying from at least one of the parent partition or the intermediate partition to a first child partition of the plurality of child partitions,
    establishing a first pointer between a first child reference file in the first child partition and a data file of the first intermediate partition, and
    establishing a second pointer between a second child reference file in the first child partition and the reference file of the first intermediate partition.
  22. The medium according to claim 21, wherein splitting the first intermediate partition into a plurality of child partitions further comprising:
    initiating a data copying from at least one of the parent partition or the intermediate partition to a second child partition of the plurality of child partitions,
    establishing a third pointer between a third child reference file in the second child partition and the data file of the first intermediate partition, and
    establishing a fourth pointer between a fourth child reference file in the second child partition and the reference file of the first intermediate partition.
  23. The medium according to claim 21 or 22, wherein the set of instructions that are executable by one or more processors of the apparatus to further perform:
    splitting the first child partition while the data copying for the first child partition is occurring, the splitting of the first child partition comprising:
    initiating a data copying from at least one of the parent partition, the first intermediate partition or the first child partition to a third child partition,
    establishing a fifth pointer between a fifth child reference file in the third child partition and a data file of the first child partition,
    establishing a sixth pointer between the sixth child reference file in the third child partition and the first child reference file in the first child partition, and
    establishing a seventh pointer between the seventh child reference file in the third child partition and the second child reference file in the first child partition.
  24. The method according to any one of claims 21-23, wherein the data copying is performed using a compaction.
  25. The medium according to any one of claims 21-24, wherein splitting the first intermediate partition into a plurality of child partitions further comprising establishing a  splitting reference counter for the first intermediate partition, wherein the splitting reference counter is incremented by a number of new reference files of child partitions that point to the first intermediate partition’s data file and is decremented by 1 in response to a child partition finishing data copying.
  26. The medium according to claim 25, wherein splitting the first intermediate partition into a plurality of child partitions further comprising:
    deleting the first intermediate partition in response to the splitting reference counter of the first intermediate partition reaching 0.
  27. The medium according to any one of claims 21-26, wherein splitting the parent partition into intermediate partitions further comprising:
    deleting the reference file of the first intermediate partition in response to the first intermediate partition finishing data copying.
  28. The medium according to any one of claims 21-27, wherein splitting the first intermediate partition into a plurality of child partitions further comprising:
    deleting the first child reference file and the second child reference file in response to the first child partition finishing data copying.
  29. The medium according to any one of claims 21-28, wherein splitting the parent partition into intermediate partitions further comprising establishing a splitting reference counter for the parent partition, wherein the splitting reference counter is incremented by a number of  new reference files that points to the data file of the parent partition and is decremented by 1 in response to an intermediate partition finishing data copying.
  30. The medium according to claim 29, wherein splitting the parent partition into intermediate partitions further comprising deleting the parent partition in response to the splitting reference counter of the parent partition reaching 0.
PCT/CN2020/072149 2020-01-15 2020-01-15 Fast partition splitting solution in distributed data storage systems WO2021142643A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/072149 WO2021142643A1 (en) 2020-01-15 2020-01-15 Fast partition splitting solution in distributed data storage systems
CN202080083354.0A CN114761913A (en) 2020-01-15 2020-01-15 Fast partition splitting solution in distributed data storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/072149 WO2021142643A1 (en) 2020-01-15 2020-01-15 Fast partition splitting solution in distributed data storage systems

Publications (1)

Publication Number Publication Date
WO2021142643A1 true WO2021142643A1 (en) 2021-07-22

Family

ID=76863402

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/072149 WO2021142643A1 (en) 2020-01-15 2020-01-15 Fast partition splitting solution in distributed data storage systems

Country Status (2)

Country Link
CN (1) CN114761913A (en)
WO (1) WO2021142643A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590553A (en) * 2021-08-03 2021-11-02 京东科技控股股份有限公司 Account checking method and device, electronic equipment and computer readable medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024019A (en) * 2010-11-04 2011-04-20 曙光信息产业(北京)有限公司 Suffix tree based catalog organizing method in distributed file system
WO2011078966A1 (en) * 2009-12-22 2011-06-30 Apple Inc. Methods and apparatuses to allocate file storage via tree representations of a bitmap
CN104133867A (en) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 DOT in-fragment secondary index method and DOT in-fragment secondary index system
CN104731921A (en) * 2015-03-26 2015-06-24 江苏物联网研究发展中心 Method for storing and processing small log type files in Hadoop distributed file system
US20180349095A1 (en) * 2017-06-06 2018-12-06 ScaleFlux, Inc. Log-structured merge tree based data storage architecture

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011078966A1 (en) * 2009-12-22 2011-06-30 Apple Inc. Methods and apparatuses to allocate file storage via tree representations of a bitmap
CN102024019A (en) * 2010-11-04 2011-04-20 曙光信息产业(北京)有限公司 Suffix tree based catalog organizing method in distributed file system
CN104133867A (en) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 DOT in-fragment secondary index method and DOT in-fragment secondary index system
CN104731921A (en) * 2015-03-26 2015-06-24 江苏物联网研究发展中心 Method for storing and processing small log type files in Hadoop distributed file system
US20180349095A1 (en) * 2017-06-06 2018-12-06 ScaleFlux, Inc. Log-structured merge tree based data storage architecture

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590553A (en) * 2021-08-03 2021-11-02 京东科技控股股份有限公司 Account checking method and device, electronic equipment and computer readable medium

Also Published As

Publication number Publication date
CN114761913A (en) 2022-07-15

Similar Documents

Publication Publication Date Title
US9239841B2 (en) Hash-based snapshots
US10108352B2 (en) Incremental replication of a source data set
US8433684B2 (en) Managing data backup of an in-memory database in a database management system
US20180032266A1 (en) Managing storage system
US10824968B2 (en) Transformation of logical data object instances and updates to same between hierarchical node schemas
US9286328B2 (en) Producing an image copy of a database object based on information within database buffer pools
US10353687B2 (en) Application virtualization
US10496601B1 (en) Efficient file system parsing using snap based replication
US20160179919A1 (en) Asynchronous data replication using an external buffer table
US20150261808A1 (en) Reduce log contention by batching log record transfers to the log
US20100180093A1 (en) Rapid defragmentation of storage volumes
US9547655B1 (en) Filesystem independent snapshot driver
CN112965951A (en) System and method for redistribution of data in a database
US20230014427A1 (en) Global secondary index method for distributed database, electronic device and storage medium
WO2021142643A1 (en) Fast partition splitting solution in distributed data storage systems
US11157456B2 (en) Replication of data in a distributed file system using an arbiter
US8918370B2 (en) Dynamic allocation of program libraries
US20140214882A1 (en) Segmenting documents within a full text index
US11768741B2 (en) Replicating changes written by a transactional virtual storage access method
US11227016B2 (en) Scalable locking techniques
Meng et al. SwingDB: An embedded in-memory DBMS enabling instant snapshot sharing
US20230107071A1 (en) Asynchronous transaction conflict resolution
US11561863B2 (en) PDSE member generation clustering and recovery
US11645200B2 (en) Reducing load balancing work stealing
JP7277675B2 (en) Zero copy optimization for Select* queries

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20913587

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20913587

Country of ref document: EP

Kind code of ref document: A1