WO2016118176A1 - Database management - Google Patents

Database management Download PDF

Info

Publication number
WO2016118176A1
WO2016118176A1 PCT/US2015/022064 US2015022064W WO2016118176A1 WO 2016118176 A1 WO2016118176 A1 WO 2016118176A1 US 2015022064 W US2015022064 W US 2015022064W WO 2016118176 A1 WO2016118176 A1 WO 2016118176A1
Authority
WO
WIPO (PCT)
Prior art keywords
database
shard
data
updated
file
Prior art date
Application number
PCT/US2015/022064
Other languages
French (fr)
Inventor
Ramesh Kannan KARUPPUSAMY
Annmary Justine KOOMTHANAM
Jothivelavan SIVASHANMUGAM
Rajkumar Kannan
Kiran Kumar MALLE GOWDA
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Publication of WO2016118176A1 publication Critical patent/WO2016118176A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning

Definitions

  • Databases provide a useful way of organizing data. Such data is usually accessed via a database management system that allows entry, storage and retrieval of data from a database.
  • FIG. 1 is a block diagram of a computing environment for managing a database, according to an example
  • FIG. 2 is a flowchart of an example method for managing a database
  • FIG. 3 is a block diagram of an example computer system for managing a database.
  • Data management is vital to success of an organization. Whether it is a private company, a government undertaking, an educational institution, a hospital, or a new start-up, managing data (for example, customer data, vendor data, patient data, etc.) in an appropriate manner is crucial to existence and growth of an enterprise.
  • Computer databases play a useful role in this regard.
  • a computer database allows an organized collection of data, which may be analyzed, for instance, with the help of a database management system, to derive useful information for a user.
  • a distributed database is a database in which portions of the database are stored on multiple computers within a network. Such computers may be located in the same physical location or may be dispersed over a wider geographical area.
  • a distributed database system thus consists of loosely-coupled sites that share no physical components.
  • data stored in a database may be sharded into a plurality of shards, wherein the database is coupled to a file system.
  • data update may be applied only to a database file that stores data affected by the update, to generate an updated database file.
  • a parameter related to a shard that includes the updated database file may be determined or tracked. If the parameter related to the shard exceeds a pre-defined threshold, all updated database files may be identified in the shard, and data stored in the updated databases files may be merged into a single database file in the shard.
  • sharding is a form of database partitioning that is used to separate a large database into smaller pieces called database shards or "shards". Data records in shards may be typically spread over multiple devices, for example, computer servers.
  • FIG. 1 is a block diagram of a computing environment for managing a database, according to an example.
  • Computing environment 100 may include a computing device 102, a file system 104, and a database 106.
  • Aforementioned components of the computing environment i.e. 102, 104, and 106, may be in communication with each other, for example, via a computer network 108.
  • Such a computer network 108 may be a wireless or wired network.
  • Computer network 108 may include, for example, a Local Area Network (LAN), a Wireless Local Area Network (WAN), a Metropolitan Area Network (MAN), a Storage Area Network (SAN), a Campus Area Network (CAN), or the like.
  • computer network 108 may be a public network (for example, the Internet) or a private network (for example, an intranet).
  • Computing device 102 generally represents any type of computing system capable of reading machine-executable instructions. Examples of computing device 102 may include, without limitation, a server, a desktop computer, a notebook computer, a tablet computer, a thin client, a mobile device, a personal digital assistant (PDA), a phablet, and the like.
  • File system 104 may be used for entry, storage and retrieval of data from the database.
  • the file system 104 may include one or more file system objects.
  • Some non-limiting examples of a file system object may include a file, a directory, an access control list (ACL), and the like.
  • File system 104 may be a local file system or a scale-out file system such as a shared file system or a network file system. Examples of a shared file system may include a Network Attached Storage (NAS) file system or a cluster file system. Examples of a network file system may include a distributed file system or a distributed parallel file system.
  • File system 104 may communicate with computing device 102 and database 106, for example, via a suitable protocol.
  • NFS Network File System
  • CIFS Common Internet File System
  • HTTP Hyper Text Transfer Protocol
  • FTP File Transfer Protocol
  • the file system 104 may be an extent- base file system.
  • Database 106 may be a repository that stores an organized collection of data.
  • the database may store data in extents.
  • An "extent" may be defined as a set of contiguous blocks allocated in a database.
  • database 106 may be a distributed database that provides high query rates and high-throughput updates using a batching process.
  • Database 106 may use a pipelined architecture that provides access to update batches at various points through processing.
  • database 106 may be based on a batched update model, which decouples update processing from read-only queries (i.e. query processing task). In this model, the updates may be batched and processed in the background, and do not interfere with the foreground query workload.
  • Database 106 may allow different stages of the updates in the pipeline to be queried independently. Queries that could use slightly out-of-date data may use only the final output of the pipeline, which may correspond to the completely ingested and indexed data. Queries that require even fresher results may access data at any stage in the pipeline.
  • the database 106 may be a metadata database that stores metadata related to unstructured data.
  • unstructured data may include documents, audio, video, images, files, body of an e-mail message, Web page, or word-processor document.
  • the database 106 may be integrated into the file system 104.
  • computing device 102 may include a file system object module 1 10, a data update module 1 12, a determination module 1 14, and a merge module 1 16.
  • module may refer to a software component (machine readable instructions), a hardware component or a combination thereof.
  • a module may include, by way of example, components, such as software components, processes, tasks, co-routines, functions, attributes, procedures, drivers, firmware, data, databases, data structures, Application Specific Integrated Circuits (ASIC) and other computing devices.
  • a module may reside on a volatile or non-volatile storage medium and configured to interact with a processor of computing device 102.
  • File system object module 1 10 may track number of file system objects in a file system (for example, 104) coupled to a database (for example, 106). In an example, if the file system object module 1 10 determines that the number of file system objects in the file system 104 coupled to the database 106 exceeds a pre-defined threshold, the file system object module 1 10 may shard the data stored in the database into a plurality of shards. In another example, database may be sharded based on some other criterion. Sharding of database partitions the data stored therein into smaller databases called database shards or "shards". Upon partition, each of the plurality of shards may store a portion or subset of the data stored in the database 106.
  • database shards may be stored over multiple devices, for example, servers. Such devices may be co-located or spread over a wider geographical region. Further, devices hosting such database shards may be in communication with each other and database 106, for example, via a network. Such a network may be a wired or wireless network, which may be similar to the network described above.
  • the database 106 may be sharded based on a function of Persistent Object Identifier (POID) of a file system object in the file system 104.
  • a hashing function may be used that takes into account the POID of a file system object to generate a hashing index that maps to a fixed number of plurality of shards.
  • each of such shards may contain a subset of file system objects in the file system 104.
  • Data update module 1 12 may, upon receipt of data update for the database, apply the data update only to a database file that stores data affected by the update, to generate an updated database file.
  • data update module 1 12 may determine which database file in a shard, among the plurality of shards, stores data that may require to be updated. Upon such determination, data update module may apply the data update to such database file only.
  • the data update may be applied to an extent(s) in an identified database file.
  • data update module may apply data updates to appropriate database files in a shard thereby leading to a scenario where there may be a plurality of updated database files in a shard.
  • Determination module 1 14 may determine a parameter(s) related to the plurality of shards that are generated upon sharding of the database 106. In an instance, determination module 1 14 may track a parameter(s) related to a shard that includes an updated database file(s). In an example, the parameter may include amount of data fragmentation that may occur in a shard if data update is applied to such shard. Since a database shard may undergo multiple data updates over a period of time (for instance, a number of rows may get updated), data fragmentation may occur in the shard. In an example, if multiple data updates are applied to extents in a shard during a course of time, extent fragmentation may occur in the shard.
  • Determination module 1 14 may determine or track such data fragmentation in the database shards that are generated upon sharding of the database. Thus, in an instance, determination module 1 14 may act as a "fragmentation counter" that tracks amount of data fragmentation in a generated database shard.
  • the parameter may include number of queries handled by a shard that is generated upon sharding of the database 106.
  • a database shard may handle a number of queries from one or more client systems over a course of time.
  • Determination module 1 14 may determine or track the number of queries handled by a shard or each of the shards that are generated upon sharding of the database 106.
  • the parameter may include number of updates that are applied to a database shard.
  • a number of updates may be applied to a shard over a course of time.
  • Determination module 1 14 may determine or track the number of updates applied to a shard or each of the shards that are generated upon sharding of the database 106.
  • the aforementioned are just some of the non-limiting examples of the parameter related to a database shard which may be determined by determination module 1 14.
  • merge module 1 16 may identify all updated database files in the shard, and merge data stored in the updated database files into a single database file in the shard. In other words, merge module 1 16 may cause defragmentation of the shard. In an example, if a shard's data is stored in extents, merge module 1 16 may, in such case, cause defragmentation of the extents in the shard. In other words, merge module 1 16 may merge data stored in updated extents of a shard into a single database file in the shard. In an example, a separate pre-defined threshold may be defined for a parameter for each of the plurality of shards that are generated upon sharding of the database 106.
  • FIG. 2 is a flowchart of an example method for managing a database.
  • the method 200 may at least partially be executed on a computing device 102 of FIG. 1 .
  • data stored in a database (for example, 106) may be sharded into a plurality of shards, wherein the database is coupled to a file system (for example, 104).
  • a file system for example, 104
  • each of the plurality of shards may store a portion of the data.
  • the data update upon receiving data update for the database 106, the data update may be applied only to a database file that stores data affected by the update, thereby generating an updated database file.
  • a parameter related to a shard that includes the updated database file may be determined.
  • FIG. 3 is a block diagram of an example system 300 for managing a database.
  • System 300 includes a processor 302 and a machine-readable storage medium 304 communicatively coupled through a system bus.
  • system 300 may be analogous to computing device 102 of FIG. 1 .
  • Processor 302 may be any type of Central Processing Unit (CPU), microprocessor, or processing logic that interprets and executes machine- readable instructions stored in machine-readable storage medium 304.
  • Machine-readable storage medium 304 may be a random access memory (RAM) or another type of dynamic storage device that may store information and machine-readable instructions that may be executed by processor 302.
  • RAM random access memory
  • machine-readable storage medium 304 may be Synchronous DRAM (SDRAM), Double Data Rate (DDR), Rambus DRAM (RDRAM), Rambus RAM, etc. or a storage memory media such as a floppy disk, a hard disk, a CD-ROM, a DVD, a pen drive, and the like.
  • machine- readable storage medium 304 may be a non-transitory machine-readable medium.
  • Machine-readable storage medium 304 may store instructions 306, 308, 310, and 312.
  • instructions 306 may be executed by processor 302 to shard a database (for example, 106) into a plurality of shards, wherein the database may be coupled to a file system (for example, 104).
  • each of the plurality of shards may store a subset of data stored in the database 106.
  • Instructions 308 may be executed by processor 302 to apply, upon receipt of data update, the data update only to a database file that stores data affected by the update, to generate an updated database file.
  • Instructions 310 may be executed by processor 302 to determine a parameter related to a shard that includes the updated database file.
  • Instructions 312 may be executed by processor 302 to identify updated database files in the shard, and merge data stored in the updated databases files into a single database file in the shard, if the parameter related to the shard exceeds a pre-defined threshold.
  • FIG. 2 is shown as executing serially, however it is to be understood and appreciated that the present and other examples are not limited by the illustrated order.
  • Embodiments within the scope of the present solution may also include program products comprising non-transitory computer-readable media for carrying or having computer-executable instructions or data structures stored thereon.
  • Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer.
  • such computer-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM, magnetic disk storage or other storage devices, or any other medium which can be used to carry or store desired program code in the form of computer-executable instructions and which can be accessed by a general purpose or special purpose computer.
  • the computer readable instructions can also be accessed from memory and executed by a processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Some examples relate to database management. In an example, data stored in a database may be sharded into a plurality of shards, wherein the database is coupled to a file system. Upon receiving data update for the database, data update may be applied only to a database file that stores data affected by the update, to generate an updated database file. A parameter related to a shard that includes the updated database file may be determined. If the parameter related to the shard exceeds a pre-defined threshold, updated database files may be identified in the shard, and data stored in the updated databases files may be merged into a single database file in the shard.

Description

DATABASE MANAGEMENT
Background
[001] Databases have become an integral part of modern day computing.
Whether it is a small start-up or a large enterprise, organizations may need to deal with a vast amount of data these days, which could range from a few terabytes to multiple petabytes of data. Databases provide a useful way of organizing data. Such data is usually accessed via a database management system that allows entry, storage and retrieval of data from a database.
Brief Description of the Drawings
[002] For a better understanding of the solution, embodiments will now be described, purely by way of example, with reference to the accompanying drawings, in which:
[003] FIG. 1 is a block diagram of a computing environment for managing a database, according to an example;
[004] FIG. 2 is a flowchart of an example method for managing a database; and
[005] FIG. 3 is a block diagram of an example computer system for managing a database.
Detailed Description
[006] Data management is vital to success of an organization. Whether it is a private company, a government undertaking, an educational institution, a hospital, or a new start-up, managing data (for example, customer data, vendor data, patient data, etc.) in an appropriate manner is crucial to existence and growth of an enterprise. Computer databases play a useful role in this regard. A computer database allows an organized collection of data, which may be analyzed, for instance, with the help of a database management system, to derive useful information for a user.
[007] Among various factors, an increase in adoption of technology by various businesses (for example, online ecommerce portals) has led to an explosion of data that may entail management of large databases by database administrators. Managing a large database may be a challenging task. It may be further demanding if the database is coupled or integrated with another computer program (for example, a file system). In other words, if the database is an "embedded database". It may also be challenging if a large database is required to be managed in a distributed environment. In other words, if the database is a "distributed database". A distributed database is a database in which portions of the database are stored on multiple computers within a network. Such computers may be located in the same physical location or may be dispersed over a wider geographical area. A distributed database system thus consists of loosely-coupled sites that share no physical components.
[008] One of the challenges of managing a large database that couples with a file system is that as the number of database rows increases due to an increase in the number of objects on the file system or due to a large number of updates to an existing file object, a large number of database rewrites may be periodically required to keep the database table fresh for addressing queries from clients. In other words, the database file may need to be constantly rewritten every time a new set of updates are to be inserted into the table in order to reflect the latest state of the file system. This may impose a huge burden on the entire system in terms of I/O, memory footprint and CPU resources since every time an update may need to be performed, it may involve a rewrite of the database table that may lead to a large number of redundant I/O rewrites on the file system. [009] To address this issue, the present disclosure describes various examples for managing a database. In an example, data stored in a database may be sharded into a plurality of shards, wherein the database is coupled to a file system. Upon receiving data update for the database, data update may be applied only to a database file that stores data affected by the update, to generate an updated database file. A parameter related to a shard that includes the updated database file may be determined or tracked. If the parameter related to the shard exceeds a pre-defined threshold, all updated database files may be identified in the shard, and data stored in the updated databases files may be merged into a single database file in the shard.
[0010] As used herein, the term "sharding" is a form of database partitioning that is used to separate a large database into smaller pieces called database shards or "shards". Data records in shards may be typically spread over multiple devices, for example, computer servers.
[0011] FIG. 1 is a block diagram of a computing environment for managing a database, according to an example. Computing environment 100 may include a computing device 102, a file system 104, and a database 106. Aforementioned components of the computing environment i.e. 102, 104, and 106, may be in communication with each other, for example, via a computer network 108. Such a computer network 108 may be a wireless or wired network. Computer network 108 may include, for example, a Local Area Network (LAN), a Wireless Local Area Network (WAN), a Metropolitan Area Network (MAN), a Storage Area Network (SAN), a Campus Area Network (CAN), or the like. Further, computer network 108 may be a public network (for example, the Internet) or a private network (for example, an intranet).
[0012] Computing device 102 generally represents any type of computing system capable of reading machine-executable instructions. Examples of computing device 102 may include, without limitation, a server, a desktop computer, a notebook computer, a tablet computer, a thin client, a mobile device, a personal digital assistant (PDA), a phablet, and the like.
[0013] File system 104 may be used for entry, storage and retrieval of data from the database. The file system 104 may include one or more file system objects. Some non-limiting examples of a file system object may include a file, a directory, an access control list (ACL), and the like. File system 104 may be a local file system or a scale-out file system such as a shared file system or a network file system. Examples of a shared file system may include a Network Attached Storage (NAS) file system or a cluster file system. Examples of a network file system may include a distributed file system or a distributed parallel file system. File system 104 may communicate with computing device 102 and database 106, for example, via a suitable protocol. Some non-limiting examples of such protocol may include Network File System (NFS) protocol, Common Internet File System (CIFS) protocol, Hyper Text Transfer Protocol (HTTP), File Transfer Protocol (FTP), and the like. In an example, the file system 104 may be an extent- base file system.
[0014] Database 106 may be a repository that stores an organized collection of data. In an example, the database may store data in extents. An "extent" may be defined as a set of contiguous blocks allocated in a database. In an example, database 106 may be a distributed database that provides high query rates and high-throughput updates using a batching process. Database 106 may use a pipelined architecture that provides access to update batches at various points through processing. In an instance, database 106 may be based on a batched update model, which decouples update processing from read-only queries (i.e. query processing task). In this model, the updates may be batched and processed in the background, and do not interfere with the foreground query workload. Database 106 may allow different stages of the updates in the pipeline to be queried independently. Queries that could use slightly out-of-date data may use only the final output of the pipeline, which may correspond to the completely ingested and indexed data. Queries that require even fresher results may access data at any stage in the pipeline.
[0015] In an example, the database 106 may be a metadata database that stores metadata related to unstructured data. Examples of unstructured data may include documents, audio, video, images, files, body of an e-mail message, Web page, or word-processor document. In an example, the database 106 may be integrated into the file system 104.
[0016] In the example of FIG. 1 , computing device 102 may include a file system object module 1 10, a data update module 1 12, a determination module 1 14, and a merge module 1 16. The term "module" may refer to a software component (machine readable instructions), a hardware component or a combination thereof. A module may include, by way of example, components, such as software components, processes, tasks, co-routines, functions, attributes, procedures, drivers, firmware, data, databases, data structures, Application Specific Integrated Circuits (ASIC) and other computing devices. A module may reside on a volatile or non-volatile storage medium and configured to interact with a processor of computing device 102.
[0017] File system object module 1 10 may track number of file system objects in a file system (for example, 104) coupled to a database (for example, 106). In an example, if the file system object module 1 10 determines that the number of file system objects in the file system 104 coupled to the database 106 exceeds a pre-defined threshold, the file system object module 1 10 may shard the data stored in the database into a plurality of shards. In another example, database may be sharded based on some other criterion. Sharding of database partitions the data stored therein into smaller databases called database shards or "shards". Upon partition, each of the plurality of shards may store a portion or subset of the data stored in the database 106. In an example, further to partition, database shards may be stored over multiple devices, for example, servers. Such devices may be co-located or spread over a wider geographical region. Further, devices hosting such database shards may be in communication with each other and database 106, for example, via a network. Such a network may be a wired or wireless network, which may be similar to the network described above.
[0018] In an example, the database 106 may be sharded based on a function of Persistent Object Identifier (POID) of a file system object in the file system 104. A hashing function may be used that takes into account the POID of a file system object to generate a hashing index that maps to a fixed number of plurality of shards. As mentioned above, each of such shards may contain a subset of file system objects in the file system 104.
[0019] Data update module 1 12 may, upon receipt of data update for the database, apply the data update only to a database file that stores data affected by the update, to generate an updated database file. In other words, further to sharding of the database into a plurality of shards, if data update is received for the database, data update module 1 12 may determine which database file in a shard, among the plurality of shards, stores data that may require to be updated. Upon such determination, data update module may apply the data update to such database file only. In an example, the data update may be applied to an extent(s) in an identified database file. In like manner, data update module may apply data updates to appropriate database files in a shard thereby leading to a scenario where there may be a plurality of updated database files in a shard.
[0020] Determination module 1 14 may determine a parameter(s) related to the plurality of shards that are generated upon sharding of the database 106. In an instance, determination module 1 14 may track a parameter(s) related to a shard that includes an updated database file(s). In an example, the parameter may include amount of data fragmentation that may occur in a shard if data update is applied to such shard. Since a database shard may undergo multiple data updates over a period of time (for instance, a number of rows may get updated), data fragmentation may occur in the shard. In an example, if multiple data updates are applied to extents in a shard during a course of time, extent fragmentation may occur in the shard. Determination module 1 14 may determine or track such data fragmentation in the database shards that are generated upon sharding of the database. Thus, in an instance, determination module 1 14 may act as a "fragmentation counter" that tracks amount of data fragmentation in a generated database shard.
[0021] In another example, the parameter may include number of queries handled by a shard that is generated upon sharding of the database 106. A database shard may handle a number of queries from one or more client systems over a course of time. Determination module 1 14 may determine or track the number of queries handled by a shard or each of the shards that are generated upon sharding of the database 106.
[0022] In another example, the parameter may include number of updates that are applied to a database shard. A number of updates may be applied to a shard over a course of time. Determination module 1 14 may determine or track the number of updates applied to a shard or each of the shards that are generated upon sharding of the database 106. The aforementioned are just some of the non-limiting examples of the parameter related to a database shard which may be determined by determination module 1 14.
[0023] In an example, if determination module 1 14 determines that a parameter related to a database shard that includes one or more updated database files exceeds a pre-defined threshold, merge module 1 16 may identify all updated database files in the shard, and merge data stored in the updated database files into a single database file in the shard. In other words, merge module 1 16 may cause defragmentation of the shard. In an example, if a shard's data is stored in extents, merge module 1 16 may, in such case, cause defragmentation of the extents in the shard. In other words, merge module 1 16 may merge data stored in updated extents of a shard into a single database file in the shard. In an example, a separate pre-defined threshold may be defined for a parameter for each of the plurality of shards that are generated upon sharding of the database 106.
[0024] FIG. 2 is a flowchart of an example method for managing a database.
The method 200, which is described below, may at least partially be executed on a computing device 102 of FIG. 1 . However, other computing devices may be used as well. At block 202, data stored in a database (for example, 106) may be sharded into a plurality of shards, wherein the database is coupled to a file system (for example, 104). Upon sharding, each of the plurality of shards may store a portion of the data. At block 204, upon receiving data update for the database 106, the data update may be applied only to a database file that stores data affected by the update, thereby generating an updated database file. At block 206, a parameter related to a shard that includes the updated database file may be determined. At block 208, if the parameter related to the shard exceeds a pre-defined threshold, all those database files that were updated in the shard are indentified, and data stored in the updated databases files is merged into a single database file in the shard.
[0025] FIG. 3 is a block diagram of an example system 300 for managing a database. System 300 includes a processor 302 and a machine-readable storage medium 304 communicatively coupled through a system bus. In an example, system 300 may be analogous to computing device 102 of FIG. 1 . Processor 302 may be any type of Central Processing Unit (CPU), microprocessor, or processing logic that interprets and executes machine- readable instructions stored in machine-readable storage medium 304. Machine-readable storage medium 304 may be a random access memory (RAM) or another type of dynamic storage device that may store information and machine-readable instructions that may be executed by processor 302. For example, machine-readable storage medium 304 may be Synchronous DRAM (SDRAM), Double Data Rate (DDR), Rambus DRAM (RDRAM), Rambus RAM, etc. or a storage memory media such as a floppy disk, a hard disk, a CD-ROM, a DVD, a pen drive, and the like. In an example, machine- readable storage medium 304 may be a non-transitory machine-readable medium. Machine-readable storage medium 304 may store instructions 306, 308, 310, and 312. In an example, instructions 306 may be executed by processor 302 to shard a database (for example, 106) into a plurality of shards, wherein the database may be coupled to a file system (for example, 104). Upon sharding, each of the plurality of shards may store a subset of data stored in the database 106. Instructions 308 may be executed by processor 302 to apply, upon receipt of data update, the data update only to a database file that stores data affected by the update, to generate an updated database file. Instructions 310 may be executed by processor 302 to determine a parameter related to a shard that includes the updated database file. Instructions 312 may be executed by processor 302 to identify updated database files in the shard, and merge data stored in the updated databases files into a single database file in the shard, if the parameter related to the shard exceeds a pre-defined threshold. 26] For the purpose of simplicity of explanation, the example method of FIG.
2 is shown as executing serially, however it is to be understood and appreciated that the present and other examples are not limited by the illustrated order. The example systems of FIGS. 1 and 3, and method of FIG.
3 may be implemented in the form of a computer program product including computer-executable instructions, such as program code, which may be run on any suitable computing device in conjunction with a suitable operating system (for example, Microsoft Windows, Linux, UNIX, and the like). Embodiments within the scope of the present solution may also include program products comprising non-transitory computer-readable media for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, such computer-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM, magnetic disk storage or other storage devices, or any other medium which can be used to carry or store desired program code in the form of computer-executable instructions and which can be accessed by a general purpose or special purpose computer. The computer readable instructions can also be accessed from memory and executed by a processor. 27] It should be noted that the above-described examples of the present solution is for the purpose of illustration only. Although the solution has been described in conjunction with a specific embodiment thereof, numerous modifications may be possible without materially departing from the teachings and advantages of the subject matter described herein. Other substitutions, modifications and changes may be made without departing from the spirit of the present solution. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.

Claims

Claims:
1 . A method of managing a database, comprising:
sharding data stored in a database into a plurality of shards, wherein the database is coupled to a file system;
upon receiving data update for the database, applying the data update only to a database file that stores data affected by the update, to generate an updated database file;
determining a parameter related to a shard that includes the updated database file; and
if the parameter related to the shard exceeds a pre-defined threshold: identifying updated database files in the shard; and
merging data stored in the updated databases files into a single database file in the shard.
2. The method of claim 1 , wherein the parameter includes data fragmentation in the shard consequent to application of the data update to the shard.
3. The method of claim 2, wherein the data fragmentation relates to data stored in an extent.
4. The method of claim 1 , wherein the parameter includes number of queries handled by the shard.
5. The method of claim 1 , wherein the parameter includes number of updates on the shard.
6. The method of claim 1 , wherein the applying comprises applying the data update only to an extent that stores data affected by the update.
7. The method of claim 6, further comprising defragmenting the extent if the parameter related to the shard exceeds the pre-defined threshold.
8. A computer system for managing a database, comprising:
a file system object module to shard data stored in database into a plurality of shards, wherein the database is coupled to a file system;
a data update module to apply, upon receipt of data update for the database, the data update only to an extent that stores data affected by the update, to generate an updated extent;
a determination module to determine a parameter related to a shard that includes the updated extent; and
a merge module to:
identify updated extents in the shard; and
merge data stored in the updated extents into a single database file in the shard, if the parameter related to the shard exceeds a predefined threshold.
9. The system of claim 8, wherein the parameter includes one of data fragmentation in the shard consequent to application of the data update to an extent in the shard, number of queries handled by the shard, and number of data updates on the shard.
10. The system of claim 8, wherein the determination module is a fragmentation counter to track fragmentation of data in the shard consequent to application of the data update to the extent in the shard.
1 1 . The system of claim 8, wherein the merge module to defragment the shard if the parameter related to the shard exceeds the pre-defined threshold.
12. The system of claim 8, wherein a separate pre-defined threshold is defined for the parameter for each of the plurality of shards.
13. A non-transitory machine-readable storage medium comprising instructions for managing a database, the instructions executable by a processor to:
shard a database into a plurality of shards, wherein the database is coupled to a file system;
apply, upon receipt of data update for the database, the data update only to a database file that stores data affected by the update to generate an updated database file;
determine a parameter related to a shard that includes the updated database file; and
identify updated database files in the shard, and merge data stored in the updated databases files into a single database file in the shard, if the parameter related to the shard exceeds a pre-defined threshold.
14. The storage medium of claim 13, wherein the database is a distributed database.
15. The storage medium of claim 13, wherein the instruction to shard comprise instructions to shard the database based on a function of Persistent Object Identifier (POID) of a file system object in the file system.
PCT/US2015/022064 2015-01-20 2015-03-23 Database management WO2016118176A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN294/CHE/2015 2015-01-20
IN294CH2015 2015-01-20

Publications (1)

Publication Number Publication Date
WO2016118176A1 true WO2016118176A1 (en) 2016-07-28

Family

ID=56417545

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/022064 WO2016118176A1 (en) 2015-01-20 2015-03-23 Database management

Country Status (1)

Country Link
WO (1) WO2016118176A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112235332A (en) * 2019-07-15 2021-01-15 北京京东尚科信息技术有限公司 Read-write switching method and device for cluster
CN111033471B (en) * 2017-06-30 2023-05-16 伊夫塔奇·舒尔曼 Method, system and medium for controlling only additional file

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020116573A1 (en) * 2001-01-31 2002-08-22 Stephen Gold Data reading and protection
US20080077762A1 (en) * 2006-09-27 2008-03-27 Network Appliance, Inc. Method and apparatus for defragmentation
US20100070715A1 (en) * 2008-09-18 2010-03-18 Waltermann Rod D Apparatus, system and method for storage cache deduplication
US20140012814A1 (en) * 2012-07-06 2014-01-09 Box, Inc. System and method for performing shard migration to support functions of a cloud-based service
US8909887B1 (en) * 2012-09-25 2014-12-09 Emc Corporation Selective defragmentation based on IO hot spots

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020116573A1 (en) * 2001-01-31 2002-08-22 Stephen Gold Data reading and protection
US20080077762A1 (en) * 2006-09-27 2008-03-27 Network Appliance, Inc. Method and apparatus for defragmentation
US20100070715A1 (en) * 2008-09-18 2010-03-18 Waltermann Rod D Apparatus, system and method for storage cache deduplication
US20140012814A1 (en) * 2012-07-06 2014-01-09 Box, Inc. System and method for performing shard migration to support functions of a cloud-based service
US8909887B1 (en) * 2012-09-25 2014-12-09 Emc Corporation Selective defragmentation based on IO hot spots

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111033471B (en) * 2017-06-30 2023-05-16 伊夫塔奇·舒尔曼 Method, system and medium for controlling only additional file
CN112235332A (en) * 2019-07-15 2021-01-15 北京京东尚科信息技术有限公司 Read-write switching method and device for cluster
CN112235332B (en) * 2019-07-15 2024-01-16 北京京东尚科信息技术有限公司 Method and device for switching reading and writing of clusters

Similar Documents

Publication Publication Date Title
US11797498B2 (en) Systems and methods of database tenant migration
US10416919B1 (en) Integrated hierarchical storage movement
US10853242B2 (en) Deduplication and garbage collection across logical databases
US10860546B2 (en) Translation of source m-node identifier to target m-node identifier
US10268716B2 (en) Enhanced hadoop framework for big-data applications
US8620924B2 (en) Refreshing a full-text search index in a partitioned database
US20180157674A1 (en) Distributed nfs metadata server
US9910906B2 (en) Data synchronization using redundancy detection
US20140108475A1 (en) Migration-destination file server and file system migration method
US20150248443A1 (en) Hierarchical host-based storage
US10417181B2 (en) Using location addressed storage as content addressed storage
US20160380840A1 (en) Data synchronization using redundancy detection
EP3669262B1 (en) Thin provisioning virtual desktop infrastructure virtual machines in cloud environments without thin clone support
US10650013B2 (en) Access operation request management
EP2778921B1 (en) A method and a system for distributed processing of a dataset
Cruz et al. A scalable file based data store for forensic analysis
WO2016118176A1 (en) Database management
WO2015187187A1 (en) Journal events in a file system and a database
US10148662B1 (en) De-duplication of access control lists
US8700583B1 (en) Dynamic tiermaps for large online databases
US20210342301A1 (en) Filesystem managing metadata operations corresponding to a file in another filesystem
US11989159B2 (en) Hybrid snapshot of a global namespace
US11016933B2 (en) Handling weakening of hash functions by using epochs
US11016946B1 (en) Method and apparatus for processing object metadata
KR101638727B1 (en) Cluster system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15879200

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15879200

Country of ref document: EP

Kind code of ref document: A1