WO2016118176A1

WO2016118176A1 - Database management

Info

Publication number: WO2016118176A1
Application number: PCT/US2015/022064
Authority: WO
Inventors: Ramesh Kannan KARUPPUSAMY; Annmary Justine KOOMTHANAM; Jothivelavan SIVASHANMUGAM; Rajkumar Kannan; Kiran Kumar MALLE GOWDA
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2015-01-20
Filing date: 2015-03-23
Publication date: 2016-07-28

Abstract

Some examples relate to database management. In an example, data stored in a database may be sharded into a plurality of shards, wherein the database is coupled to a file system. Upon receiving data update for the database, data update may be applied only to a database file that stores data affected by the update, to generate an updated database file. A parameter related to a shard that includes the updated database file may be determined. If the parameter related to the shard exceeds a pre-defined threshold, updated database files may be identified in the shard, and data stored in the updated databases files may be merged into a single database file in the shard.

Description

DATABASE MANAGEMENT

Background

[001] Databases have become an integral part of modern day computing.

Whether it is a small start-up or a large enterprise, organizations may need to deal with a vast amount of data these days, which could range from a few terabytes to multiple petabytes of data. Databases provide a useful way of organizing data. Such data is usually accessed via a database management system that allows entry, storage and retrieval of data from a database.

Brief Description of the Drawings

[002] For a better understanding of the solution, embodiments will now be described, purely by way of example, with reference to the accompanying drawings, in which:

[003] FIG. 1 is a block diagram of a computing environment for managing a database, according to an example;

[004] FIG. 2 is a flowchart of an example method for managing a database; and

[005] FIG. 3 is a block diagram of an example computer system for managing a database.

Detailed Description

[006] Data management is vital to success of an organization. Whether it is a private company, a government undertaking, an educational institution, a hospital, or a new start-up, managing data (for example, customer data, vendor data, patient data, etc.) in an appropriate manner is crucial to existence and growth of an enterprise. Computer databases play a useful role in this regard. A computer database allows an organized collection of data, which may be analyzed, for instance, with the help of a database management system, to derive useful information for a user.

[007] Among various factors, an increase in adoption of technology by various businesses (for example, online ecommerce portals) has led to an explosion of data that may entail management of large databases by database administrators. Managing a large database may be a challenging task. It may be further demanding if the database is coupled or integrated with another computer program (for example, a file system). In other words, if the database is an "embedded database". It may also be challenging if a large database is required to be managed in a distributed environment. In other words, if the database is a "distributed database". A distributed database is a database in which portions of the database are stored on multiple computers within a network. Such computers may be located in the same physical location or may be dispersed over a wider geographical area. A distributed database system thus consists of loosely-coupled sites that share no physical components.

[008] One of the challenges of managing a large database that couples with a file system is that as the number of database rows increases due to an increase in the number of objects on the file system or due to a large number of updates to an existing file object, a large number of database rewrites may be periodically required to keep the database table fresh for addressing queries from clients. In other words, the database file may need to be constantly rewritten every time a new set of updates are to be inserted into the table in order to reflect the latest state of the file system. This may impose a huge burden on the entire system in terms of I/O, memory footprint and CPU resources since every time an update may need to be performed, it may involve a rewrite of the database table that may lead to a large number of redundant I/O rewrites on the file system. [009] To address this issue, the present disclosure describes various examples for managing a database. In an example, data stored in a database may be sharded into a plurality of shards, wherein the database is coupled to a file system. Upon receiving data update for the database, data update may be applied only to a database file that stores data affected by the update, to generate an updated database file. A parameter related to a shard that includes the updated database file may be determined or tracked. If the parameter related to the shard exceeds a pre-defined threshold, all updated database files may be identified in the shard, and data stored in the updated databases files may be merged into a single database file in the shard.

[0010] As used herein, the term "sharding" is a form of database partitioning that is used to separate a large database into smaller pieces called database shards or "shards". Data records in shards may be typically spread over multiple devices, for example, computer servers.

[0011] FIG. 1 is a block diagram of a computing environment for managing a database, according to an example. Computing environment 100 may include a computing device 102, a file system 104, and a database 106. Aforementioned components of the computing environment i.e. 102, 104, and 106, may be in communication with each other, for example, via a computer network 108. Such a computer network 108 may be a wireless or wired network. Computer network 108 may include, for example, a Local Area Network (LAN), a Wireless Local Area Network (WAN), a Metropolitan Area Network (MAN), a Storage Area Network (SAN), a Campus Area Network (CAN), or the like. Further, computer network 108 may be a public network (for example, the Internet) or a private network (for example, an intranet).

[0012] Computing device 102 generally represents any type of computing system capable of reading machine-executable instructions. Examples of computing device 102 may include, without limitation, a server, a desktop computer, a notebook computer, a tablet computer, a thin client, a mobile device, a personal digital assistant (PDA), a phablet, and the like.

[0013] File system 104 may be used for entry, storage and retrieval of data from the database. The file system 104 may include one or more file system objects. Some non-limiting examples of a file system object may include a file, a directory, an access control list (ACL), and the like. File system 104 may be a local file system or a scale-out file system such as a shared file system or a network file system. Examples of a shared file system may include a Network Attached Storage (NAS) file system or a cluster file system. Examples of a network file system may include a distributed file system or a distributed parallel file system. File system 104 may communicate with computing device 102 and database 106, for example, via a suitable protocol. Some non-limiting examples of such protocol may include Network File System (NFS) protocol, Common Internet File System (CIFS) protocol, Hyper Text Transfer Protocol (HTTP), File Transfer Protocol (FTP), and the like. In an example, the file system 104 may be an extent- base file system.

[0014] Database 106 may be a repository that stores an organized collection of data. In an example, the database may store data in extents. An "extent" may be defined as a set of contiguous blocks allocated in a database. In an example, database 106 may be a distributed database that provides high query rates and high-throughput updates using a batching process. Database 106 may use a pipelined architecture that provides access to update batches at various points through processing. In an instance, database 106 may be based on a batched update model, which decouples update processing from read-only queries (i.e. query processing task). In this model, the updates may be batched and processed in the background, and do not interfere with the foreground query workload. Database 106 may allow different stages of the updates in the pipeline to be queried independently. Queries that could use slightly out-of-date data may use only the final output of the pipeline, which may correspond to the completely ingested and indexed data. Queries that require even fresher results may access data at any stage in the pipeline.

[0015] In an example, the database 106 may be a metadata database that stores metadata related to unstructured data. Examples of unstructured data may include documents, audio, video, images, files, body of an e-mail message, Web page, or word-processor document. In an example, the database 106 may be integrated into the file system 104.

[0016] In the example of FIG. 1 , computing device 102 may include a file system object module 1 10, a data update module 1 12, a determination module 1 14, and a merge module 1 16. The term "module" may refer to a software component (machine readable instructions), a hardware component or a combination thereof. A module may include, by way of example, components, such as software components, processes, tasks, co-routines, functions, attributes, procedures, drivers, firmware, data, databases, data structures, Application Specific Integrated Circuits (ASIC) and other computing devices. A module may reside on a volatile or non-volatile storage medium and configured to interact with a processor of computing device 102.

[0017] File system object module 1 10 may track number of file system objects in a file system (for example, 104) coupled to a database (for example, 106). In an example, if the file system object module 1 10 determines that the number of file system objects in the file system 104 coupled to the database 106 exceeds a pre-defined threshold, the file system object module 1 10 may shard the data stored in the database into a plurality of shards. In another example, database may be sharded based on some other criterion. Sharding of database partitions the data stored therein into smaller databases called database shards or "shards". Upon partition, each of the plurality of shards may store a portion or subset of the data stored in the database 106. In an example, further to partition, database shards may be stored over multiple devices, for example, servers. Such devices may be co-located or spread over a wider geographical region. Further, devices hosting such database shards may be in communication with each other and database 106, for example, via a network. Such a network may be a wired or wireless network, which may be similar to the network described above.

[0018] In an example, the database 106 may be sharded based on a function of Persistent Object Identifier (POID) of a file system object in the file system 104. A hashing function may be used that takes into account the POID of a file system object to generate a hashing index that maps to a fixed number of plurality of shards. As mentioned above, each of such shards may contain a subset of file system objects in the file system 104.

[0019] Data update module 1 12 may, upon receipt of data update for the database, apply the data update only to a database file that stores data affected by the update, to generate an updated database file. In other words, further to sharding of the database into a plurality of shards, if data update is received for the database, data update module 1 12 may determine which database file in a shard, among the plurality of shards, stores data that may require to be updated. Upon such determination, data update module may apply the data update to such database file only. In an example, the data update may be applied to an extent(s) in an identified database file. In like manner, data update module may apply data updates to appropriate database files in a shard thereby leading to a scenario where there may be a plurality of updated database files in a shard.

[0020] Determination module 1 14 may determine a parameter(s) related to the plurality of shards that are generated upon sharding of the database 106. In an instance, determination module 1 14 may track a parameter(s) related to a shard that includes an updated database file(s). In an example, the parameter may include amount of data fragmentation that may occur in a shard if data update is applied to such shard. Since a database shard may undergo multiple data updates over a period of time (for instance, a number of rows may get updated), data fragmentation may occur in the shard. In an example, if multiple data updates are applied to extents in a shard during a course of time, extent fragmentation may occur in the shard. Determination module 1 14 may determine or track such data fragmentation in the database shards that are generated upon sharding of the database. Thus, in an instance, determination module 1 14 may act as a "fragmentation counter" that tracks amount of data fragmentation in a generated database shard.

[0021] In another example, the parameter may include number of queries handled by a shard that is generated upon sharding of the database 106. A database shard may handle a number of queries from one or more client systems over a course of time. Determination module 1 14 may determine or track the number of queries handled by a shard or each of the shards that are generated upon sharding of the database 106.

[0022] In another example, the parameter may include number of updates that are applied to a database shard. A number of updates may be applied to a shard over a course of time. Determination module 1 14 may determine or track the number of updates applied to a shard or each of the shards that are generated upon sharding of the database 106. The aforementioned are just some of the non-limiting examples of the parameter related to a database shard which may be determined by determination module 1 14.

[0023] In an example, if determination module 1 14 determines that a parameter related to a database shard that includes one or more updated database files exceeds a pre-defined threshold, merge module 1 16 may identify all updated database files in the shard, and merge data stored in the updated database files into a single database file in the shard. In other words, merge module 1 16 may cause defragmentation of the shard. In an example, if a shard's data is stored in extents, merge module 1 16 may, in such case, cause defragmentation of the extents in the shard. In other words, merge module 1 16 may merge data stored in updated extents of a shard into a single database file in the shard. In an example, a separate pre-defined threshold may be defined for a parameter for each of the plurality of shards that are generated upon sharding of the database 106.

[0024] FIG. 2 is a flowchart of an example method for managing a database.

The method 200, which is described below, may at least partially be executed on a computing device 102 of FIG. 1 . However, other computing devices may be used as well. At block 202, data stored in a database (for example, 106) may be sharded into a plurality of shards, wherein the database is coupled to a file system (for example, 104). Upon sharding, each of the plurality of shards may store a portion of the data. At block 204, upon receiving data update for the database 106, the data update may be applied only to a database file that stores data affected by the update, thereby generating an updated database file. At block 206, a parameter related to a shard that includes the updated database file may be determined. At block 208, if the parameter related to the shard exceeds a pre-defined threshold, all those database files that were updated in the shard are indentified, and data stored in the updated databases files is merged into a single database file in the shard.

[0025] FIG. 3 is a block diagram of an example system 300 for managing a database. System 300 includes a processor 302 and a machine-readable storage medium 304 communicatively coupled through a system bus. In an example, system 300 may be analogous to computing device 102 of FIG. 1 . Processor 302 may be any type of Central Processing Unit (CPU), microprocessor, or processing logic that interprets and executes machine- readable instructions stored in machine-readable storage medium 304. Machine-readable storage medium 304 may be a random access memory (RAM) or another type of dynamic storage device that may store information and machine-readable instructions that may be executed by processor 302. For example, machine-readable storage medium 304 may be Synchronous DRAM (SDRAM), Double Data Rate (DDR), Rambus DRAM (RDRAM), Rambus RAM, etc. or a storage memory media such as a floppy disk, a hard disk, a CD-ROM, a DVD, a pen drive, and the like. In an example, machine- readable storage medium 304 may be a non-transitory machine-readable medium. Machine-readable storage medium 304 may store instructions 306, 308, 310, and 312. In an example, instructions 306 may be executed by processor 302 to shard a database (for example, 106) into a plurality of shards, wherein the database may be coupled to a file system (for example, 104). Upon sharding, each of the plurality of shards may store a subset of data stored in the database 106. Instructions 308 may be executed by processor 302 to apply, upon receipt of data update, the data update only to a database file that stores data affected by the update, to generate an updated database file. Instructions 310 may be executed by processor 302 to determine a parameter related to a shard that includes the updated database file. Instructions 312 may be executed by processor 302 to identify updated database files in the shard, and merge data stored in the updated databases files into a single database file in the shard, if the parameter related to the shard exceeds a pre-defined threshold. 26] For the purpose of simplicity of explanation, the example method of FIG.

2 is shown as executing serially, however it is to be understood and appreciated that the present and other examples are not limited by the illustrated order. The example systems of FIGS. 1 and 3, and method of FIG.

3 may be implemented in the form of a computer program product including computer-executable instructions, such as program code, which may be run on any suitable computing device in conjunction with a suitable operating system (for example, Microsoft Windows, Linux, UNIX, and the like). Embodiments within the scope of the present solution may also include program products comprising non-transitory computer-readable media for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, such computer-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM, magnetic disk storage or other storage devices, or any other medium which can be used to carry or store desired program code in the form of computer-executable instructions and which can be accessed by a general purpose or special purpose computer. The computer readable instructions can also be accessed from memory and executed by a processor. 27] It should be noted that the above-described examples of the present solution is for the purpose of illustration only. Although the solution has been described in conjunction with a specific embodiment thereof, numerous modifications may be possible without materially departing from the teachings and advantages of the subject matter described herein. Other substitutions, modifications and changes may be made without departing from the spirit of the present solution. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.

Claims

Claims:

1 . A method of managing a database, comprising:

sharding data stored in a database into a plurality of shards, wherein the database is coupled to a file system;

upon receiving data update for the database, applying the data update only to a database file that stores data affected by the update, to generate an updated database file;

determining a parameter related to a shard that includes the updated database file; and

if the parameter related to the shard exceeds a pre-defined threshold: identifying updated database files in the shard; and

merging data stored in the updated databases files into a single database file in the shard.

2. The method of claim 1 , wherein the parameter includes data fragmentation in the shard consequent to application of the data update to the shard.

3. The method of claim 2, wherein the data fragmentation relates to data stored in an extent.

4. The method of claim 1 , wherein the parameter includes number of queries handled by the shard.

5. The method of claim 1 , wherein the parameter includes number of updates on the shard.

6. The method of claim 1 , wherein the applying comprises applying the data update only to an extent that stores data affected by the update.

7. The method of claim 6, further comprising defragmenting the extent if the parameter related to the shard exceeds the pre-defined threshold.

8. A computer system for managing a database, comprising:

a file system object module to shard data stored in database into a plurality of shards, wherein the database is coupled to a file system;

a data update module to apply, upon receipt of data update for the database, the data update only to an extent that stores data affected by the update, to generate an updated extent;

a determination module to determine a parameter related to a shard that includes the updated extent; and

a merge module to:

identify updated extents in the shard; and

merge data stored in the updated extents into a single database file in the shard, if the parameter related to the shard exceeds a predefined threshold.

9. The system of claim 8, wherein the parameter includes one of data fragmentation in the shard consequent to application of the data update to an extent in the shard, number of queries handled by the shard, and number of data updates on the shard.

10. The system of claim 8, wherein the determination module is a fragmentation counter to track fragmentation of data in the shard consequent to application of the data update to the extent in the shard.

1 1 . The system of claim 8, wherein the merge module to defragment the shard if the parameter related to the shard exceeds the pre-defined threshold.

12. The system of claim 8, wherein a separate pre-defined threshold is defined for the parameter for each of the plurality of shards.

13. A non-transitory machine-readable storage medium comprising instructions for managing a database, the instructions executable by a processor to:

shard a database into a plurality of shards, wherein the database is coupled to a file system;

apply, upon receipt of data update for the database, the data update only to a database file that stores data affected by the update to generate an updated database file;

determine a parameter related to a shard that includes the updated database file; and

identify updated database files in the shard, and merge data stored in the updated databases files into a single database file in the shard, if the parameter related to the shard exceeds a pre-defined threshold.

14. The storage medium of claim 13, wherein the database is a distributed database.

15. The storage medium of claim 13, wherein the instruction to shard comprise instructions to shard the database based on a function of Persistent Object Identifier (POID) of a file system object in the file system.