WO2016175880A1 - Merging incoming data in a database - Google Patents

Merging incoming data in a database Download PDF

Info

Publication number
WO2016175880A1
WO2016175880A1 PCT/US2015/047662 US2015047662W WO2016175880A1 WO 2016175880 A1 WO2016175880 A1 WO 2016175880A1 US 2015047662 W US2015047662 W US 2015047662W WO 2016175880 A1 WO2016175880 A1 WO 2016175880A1
Authority
WO
WIPO (PCT)
Prior art keywords
extent
data
authority file
file
extents
Prior art date
Application number
PCT/US2015/047662
Other languages
French (fr)
Inventor
Annmary Justin KOOMTHANAM
Joaquim GOMES DA COSTA EULALIO DE SOUZA
Hugo KIEHL
Hamilton De Freitas Coutinho
Michael J. Spitzer
Rajkumar Kannan
Kiran Kumar Malle Gowda
Jothivelavan SIVASHANMUGAM
Ramesh Kannan KARUPPUSAMY
Kimberly Keeton
Charles B. Morrey, Iii
Lucas Holz Boffo
Rafael Anton EICHELBERGER
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Publication of WO2016175880A1 publication Critical patent/WO2016175880A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries

Definitions

  • Non-relational databases such as NoSQL databases, generally employ a pipeline architecture that processes incoming data through multiple stages before storing the incoming data.
  • the multiple stages may include an ingest stage, a sort stage, and a merge stage.
  • FIG.1 illustrates components of a data merging system, according to an example implementation of the present subject matter.
  • FIG. 2 illustrates a network implementation of the data merging system, according to another example implementation of the present subject matter.
  • FIG. 3 illustrates a schematic diagram depicting the extent based merge, according to another example implementation of the present subject matter.
  • FIG.4 illustrates a method for merging incoming data in a database, according to an example implementation of the present subject matter.
  • FIG.5 illustrates a method for merging incoming data in a database, according to another example of the present subject matter.
  • FIG.6 illustrates a computer readable medium storing instructions for merging incoming data in a database, according to an example implementation of the present subject matter.
  • Non-relational databases can store fully materialized view of data in files referred to as authority files.
  • a materialized view may be understood as a database object that may include the results of a query.
  • a fully materialized view may be understood as a materialized view of the database with the complete data of the database, where the data may be sorted as per various fields based on the query to which the materialized view pertains.
  • the non-relational databases can include a plurality of authority files, where each authority file includes the complete data of the database, but is sorted as per different queries.
  • an authority file may store the data in the form of extents and can be indexed based on the extents for search and retrieval of data.
  • An extent may include a group of contiguous blocks of data stored as rows.
  • the non-relational databases employ a pipelined architecture in which incoming rows are passed through ingest, sort, and merge stages before the incoming data is stored in the database.
  • the ingest stage and the sort stage the incoming data is received and sorted.
  • the merge stage the sorted data is merged with existing data of the NoSQL database.
  • the incoming data may be merged in accordance with a data format of the NoSQL database.
  • rows of the authority file are first read and then the authority file is rewritten based on the incoming data.
  • the authority file may grow to a large size, such as of the order of a few gigabytes.
  • rewriting the authority file every time there is an update in the form of incoming data, may be resource consuming as the extents of the authority file that are cold, i.e., not accessed or modified for a long time, are also read and rewritten.
  • the authority file contains a large number of records, such as of the order of millions of records, rewriting the authority file can be further time consuming. In certain cases, such rewrite may take a few hours to get completed. Consequently, in situations where the database has multiple authority files, the time spent in rewriting the authority files may increase proportionally.
  • the non-relational database responds to the query based on a previous version of the authority file. This may result in reduced freshness of the results as the query is responded to on the basis of old or stale data. Furthermore, frequent rewrite of the authority files may cause a high amount of data duplication, thereby resulting in high disk footprint.
  • a data merging system may receive a request for merging the incoming data in an authority file. The incoming data is the data received by the data merging system for being stored in the database.
  • the data merging system may determine where the incoming data is to be stored based on the key value pairs. For example, the data merging system may determine whether an extent of the authority file is to be modified based on the incoming data or whether a new extent is to be created for the incoming data or both actions are to be done, i.e., a part of the incoming data is to be stored in a new extent and another part of the incoming data is to be stored in a modified extent. [0014] In an example, to determine whether an extent is to be modified to store the incoming data, the data merging system may compare the incoming data with an index file of the authority file.
  • the index file may include information, such as a range of primary keys stored in each extent and an offset of each extent.
  • the data merging system may create a delta authority file to store the incoming data in an updated extent or a new extent or partly in both. Further, upon creation of the delta authority file, a new index file is also created.
  • the new index file includes pointers to the new and updated extents of the delta authority file along with the pointers for unmodified existing extents from the authority file.
  • the delta authority file together with the unmodified existing extents forms the new authority file that can be searched using the new index file.
  • the new authority file can be created and accessed without having to rewrite unmodified extents.
  • the data merging system may also defragment the extents.
  • Defragmentation refers to consolidation of fragmented contiguous data for better disk utilization and efficiency.
  • the data merging system may identify the extents for defragmentation, based on a plurality of parameters.
  • the plurality of parameters may include a number of valid rows in the extent, a minimum number of rows in the extent, and a maximum number of delta files.
  • Valid rows may be understood as rows that are not marked for deletion. In one example, when the number of valid rows in the extent or the minimum number of rows in the extent becomes less than a pre-defined threshold value, the data merging system may identify those extents for defragmentation.
  • the data merging system of the present subject matter provides for better use of system resources by avoiding redundant rewrite of an authority file whenever there is a merge request. Further, as the extents having cold data are not read or rewritten, there is no duplication of such extents and hence the data merging system may facilitate in better utilization of disk space. In addition, as the rewrite is performed for the updated extents or new extents, the merge time of the incoming data is reduced. Therefore, more queries can be responded to, based on fresh data and better search results can be provided. [0017] The various systems and the methods are further described in conjunction with the following figures. It should be noted that the description and figures merely illustrate the principles of the present subject matter.
  • FIG. 1 illustrates the components of a data merging system 100, according to an example of the present subject matter.
  • the data merging system 100 may be implemented as a computing system, such as a desktop, a laptop, a server, and the like.
  • the data merging system 100 can be implemented in a network environment comprising a variety of network devices, including routers, bridges, servers, computing devices, storage devices, etc. While the data merging system 100 is shown as a separate system, it may be understood that the data merging system 100 can be a part of a computing system implementing a pipeline architecture for management of non-relational databases.
  • the data merging system 100 includes a processor 102.
  • the processor 102 may be implemented as microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any other devices that manipulate signals and data based on computer-readable instructions. Further, functions of the various elements shown in FIG. 1, including functional blocks labeled as “processor(s)”, may be provided through the use of dedicated hardware as well as hardware capable of executing computer-readable instructions.
  • the data merging system 100 may include an updated authority file creation module 104 and a defragmentation module 106, coupled to the processor 102.
  • the updated authority file creation module 104 and the defragmentation module 106 include routines, programs, objects, components, and data structures, which perform particular tasks or implement particular abstract data types.
  • the updated authority file creation module 104 and the defragmentation module 106 may also be implemented as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the updated authority file creation module 104 and the defragmentation module 106 can be implemented by hardware, by computer-readable instructions executed by a processing unit, or by a combination thereof. [0022]
  • the updated authority file creation module 104 may determine whether an extent is to be modified in an authority file.
  • the incoming data may be understood as data received by the data merging system 100 for being stored in the database, such as a NoSQL database, which stores data in (key, value) format.
  • the incoming data may include update data or new data or a combination of the two.
  • the incoming data may relate to a plurality of rows of data including rows to be updated and new rows to be added. Updating of rows can include modification of data in the rows or deletion of the rows.
  • the rows to be updated can be identified and the extents including those rows can be updated.
  • the new data may include data that will be stored in the database for the first time.
  • the new data may include new rows that can be either merged into an existing extent, thereby creating an updated extent, or a new extent can be created to store the new rows. This is performed based on the (key, value) pair of the new row of data. For example, if the primary key of the new row is less than the smallest primary key of a first extent or more than the largest primary key of a last extent in an authority file, the new extent may be created before the first extent or after the last extent of the authority file.
  • the updated authority file creation module 104 may generate an updated authority file, based on the determination of whether or not an extent is to be modified in an authority file.
  • the updated authority file may store the incoming data either as an updated extent or a new extent or partly in both. It will be understood that more than one extent may be updated or newly created and the term“an” includes one or more.
  • the incoming data may include update of data stored in an existing extent as well as new data.
  • the updated authority file may store the incoming data as the updated extent, the new extent, or a combination of the two.
  • the updated extent may include existing rows as well as new rows that are being merged in the database or previously existing rows that are being updated in the database.
  • the updated authority file creation module 104 may rewrite the unchanged existing rows of the extent and also write the new or updated rows in the same extent to generate the updated extent.
  • the updated authority file creation module 104 may write the new rows in the new extent without writing existing rows.
  • the delta authority files include the extents that have been affected by the incoming data and do not include existing extents that have not been affected. Details pertaining to the delta authority files are provided in conjunction with FIGS.2 and 3. It may be noted that the delta authority files and an existing authority file may together be referred to as the updated authority file. [0026] Once the updated authority file is created with the new extents or the updated extents, the data merging system 100 may defragment the extents so created. In an implementation, the defragmentation module 106 may identify the extents from the updated authority files for defragmentation. In an example, the defragmentation module 106 may identify the extents based on a plurality of parameters.
  • the plurality of parameters may include a number of valid rows in an extent, a minimum number of rows in the extent, and a maximum number of delta files supported. In one example, when the number of valid rows in an extent or the minimum number of rows in the extent crosses a pre-defined threshold value, the defragmentation module 106 may identify such extents for defragmentation.
  • FIG. 2 illustrates a network environment 200 including the data merging system 100 according to another example of the present subject matter.
  • the data merging system 100 may be implemented in various computing systems, such as personal computers, servers, etc.
  • the data merging system 100 may be implemented on a network interfaced computing system.
  • the data merging system 100 may communicate with a plurality of devices 202-1, 202-2,... , 202-N over a network 204 and may receive the incoming data from the devices 202.
  • the devices 202-1, 202-2,..., 202-N can be collectively referred to as devices 202 and individually referred to as a device 202 hereinafter.
  • the devices 202 can include, but are not restricted to, desktop computers, laptops, data servers, and the like.
  • the data merging system 100 may initiate the backup of the at least one file stored in the devices 202.
  • the network 204 may be a wired network, a wireless network or a combination of a wired and wireless network.
  • the network 204 can also be a collection of individual networks, which may use different protocols for communication, interconnected with each other. Further, the network 204 can include various network elements, such as gateways, modems, routers; however, such details have been omitted for ease of understanding. In one example, the network 204 may be a private network, such as an enterprise network, or a public network, such as a cloud network, or a hybrid network.
  • the data merging system 100 includes the processor 102 and a memory 206 connected to the processor 102. Among other capabilities, the processor(s) 102 may fetch and execute computer-readable instructions stored in the memory 206.
  • the memory 206 communicatively coupled to the processor 102, can include any non-transitory computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • volatile memory such as static random access memory (SRAM) and dynamic random access memory (DRAM)
  • non-volatile memory such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • the data merging system 100 also includes interface(s) 208.
  • the interface(s) 208 may include a variety of interfaces, for example, interface(s) 208 for user device(s), such as the devices 202 and network devices of the network 204.
  • the interface(s) 208 may include data input and output devices, referred to as I/O devices.
  • the interface(s) 208 facilitate the communication of the data merging system 100 with various communication and computing devices and various communication networks, such as networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP) and Transmission Control Protocol/Internet Protocol (TCP/IP).
  • HTTP Hypertext Transfer Protocol
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • the data merging system 100 also includes modules 210.
  • the modules 210 may include a comparison module 212, and other module(s) 214.
  • the other module(s) 214 may include programs or coded instructions that supplement the applications or functions performed by the data merging system 100.
  • the modules 210 may be implemented as described in relation to FIG.1.
  • the data merging system 100 includes data 216.
  • the data 216 may include incoming data 218, defragmentation data 220, and other data 222.
  • the other data 222 may include data generated and saved by the modules 210 for implementing various functionalities of the data merging system 100.
  • the comparison module 212 may receive a request to merge the incoming data in the authority file.
  • the incoming data may be new data for being added in the database or may be an update of the data already present in the database. In an example, the incoming data may be a combination of both the new data and the update data.
  • the merge request may be received as a part of the pipeline architecture of the non-relational database, such as the NoSQL database.
  • the data is passed through three stages, namely an ingest stage, a sort stage, and a merge stage.
  • the ingest stage of the pipeline architecture collects the incoming data from various sources and segregates them into batches of one or more unsorted data structures.
  • the ingested data may be sorted, for example, on the basis of (key, value) pairs.
  • the sorting of the data may result in creation of searchable tables.
  • individual searchable table is combined together as well as with existing tables to form the authority files.
  • the authority file represents a fully materialized view of the database and is searchable by the data merging system 100 in response to a user query.
  • the authority file may include at least one extent.
  • An extent may include a group of contiguous blocks of data stored as rows. Accordingly, one authority file may include multiple extents.
  • An example authority data file is depicted in Table 1 below:
  • the data merging system 100 in the merge stage, rewrites the modified extents into a new partial authority file hereinafter referred to as a delta authority file.
  • the delta authority file may include subset of the data of the authority file, for example, in the form of the modified extents.
  • the modified extents may include new extents and updated extents, based on the incoming data.
  • a new index file is also created.
  • the new index file may include pointers for multiple delta authority files, along with the names of the delta authority files and the offset of the extent in the delta authority files, and pointers for unmodified extents.
  • the various delta authority files and the authority file together form the updated authority file.
  • the comparison module 212 may compare the incoming data with an index file of the authority file.
  • the index file may include information, such as a range of primary keys stored in each extent and an offset of each extent.
  • the index file may include pointers to the extents of the authority file.
  • An example index file is depicted in Table 2 below:
  • the index file includes the minimum and maximum value of primary key in each extent in the authority file.
  • the index file also points towards extent offsets of the authority file.
  • the index file therefore facilitates in determining which primary key is stored in which extent.
  • the index file of the present subject matter may include pointers directed to extents of more than one delta authority file.
  • the comparison module 212 determines the extents in which the incoming data is to be merged.
  • the updated authority file creation module 104 may, based on the comparison, determine whether an extent is to be modified.
  • the delta authority file that is created may either include the incoming data of that row in an updated extent or a new extent.
  • the updated extent may be created when the incoming data is merged with either of the two extents between which the primary key of the incoming data fits.
  • the incoming data may be accommodated in an existing extent for which the extent may be written to a delta authority file, along with the incoming data, as an updated extent.
  • a new extent may be created and the incoming data may be accommodated between the two extents in the new extent containing the primary key for which the data is to be stored.
  • the primary key of the incoming data may fit within the range of primary keys of an existing extent, in which case the incoming data will be accommodated in the extent by rewriting the extent along with the incoming data.
  • the delta authority file may store the incoming data as a new extent.
  • the delta authority file may include a plurality of extents including a combination of updated extents and new extents.
  • the delta authority file along with the previous generation authority file may be referred to as an updated authority file.
  • the previous generation authority file may itself include a delta authority file that was created when an earlier generation authority file was updated.
  • the incoming data may be stored with one of the two extents. Referring to Table 3 below, an example index file is depicted:
  • the index file of Table 3 represents the authority file as depicted in Table 1 above.
  • the incoming data corresponding to that row may be inserted either with extent 001 or with extent 002 or as a new extent between 001 and 002.
  • the determination of whether the incoming data is to be merged with an existing extent and with which existing extent is based on a penalty that may be associated with the merge operation.
  • the penalty may be understood as a loss that may be caused in terms of time consumed or resources utilized in performing an action.
  • the incoming data may be written in a new extent.
  • the configurable size limit may be defined by a system administrator. Accordingly, the updated authority file creation module 104 may determine whether, based on the incoming data, the existing extents are to be modified or a new extent is to be created. Table 4 below depicts an example incoming data including two rows of data with corresponding key, value pairs:
  • the incoming data includes a new key, value pair (140, X1) as well as updated value of an existing key (220, X2).
  • the updated authority file creation module 104 may store the incoming data as the incoming data 218.
  • a new index file and a delta authority file may be created as depicted in Tables 5 and 6 respectively:
  • index entries are included to indicate the new extent having the key 140 and the updated extent which is written with the updated key value pair of the key 220 and the index entries of unchanged extents remain unchanged.
  • the delta authority file as shown in table 6 includes the changed extents, i.e., the new extent and the updated extent. The delta authority file does not include the unchanged extents.
  • a new index file and a delta authority file may be created as depicted in Tables 7 and 8 respectively:
  • the delta authority file includes the earlier extent 002 with both the new data row having primary key 140 and the updated value of the value corresponding to primary key 220.
  • the incoming data 218 may have multiple rows of data for which extents may be modified, new extents may be created, or a combination of both may be done. Accordingly the delta authority file that is created may have multiple new and modified extents.
  • the updated authority file creation module 104 may generate an updated authority file.
  • the updated authority file may include at least one delta authority file and the authority file from which the delta authority file was created.
  • a plurality of delta files may be created. Such creation of the plurality of delta files may result in fragmentation of data, thereby resulting in slower query performance.
  • the defragmentation module 106 may facilitate in keeping the fragmentation within an acceptable boundary.
  • the defragmentation module 106 may identify the extents for defragmentation. In an example, in one merge cycle the extents may be identified for defragmentation, and in the next merge cycle the defragmentation may be performed. [0053] In an example, the identification of the extents for defragmentation may be based on a plurality of parameters.
  • the plurality of parameters may include a number of valid rows in the extent, a minimum number of rows in the extent, and a maximum number of delta files created for optimum performance. For example, when the number of valid rows in the extent or the minimum number of rows in the extent crosses a pre-defined threshold value, the defragmentation module 106 may identify the extents for defragmentation. In another example, when the maximum number of delta files created exceeds the pre-defined threshold limit, the defragmentation module 106 may identify the extents for defragmentation. [0054] In an implementation, the defragmentation module 106 may store identifiers for the identified extents as the defragmentation data 220.
  • the defragmentation data 220 may be stored by the defragmentation module 106 at the end of the new index file and may be marked as identified for defragmentation.
  • the defragmentation module 106 may read the defragmentation data 220 and the identified extents may be defragmented by the defragmentation module 106.
  • the defragmentation module 106 may also consider an optimum size of an extent, before defragmenting the same. In an example, the optimum size of the extent may be user defined based on various parameters, such as system configuration and frequency of defragmentation desirable.
  • the defragmentation module 106 may not initiate defragmentation while the size of the extent is inside this interval. On the other hand, when the size of the extent goes outside the working interval, the defragmentation module 106 may decide whether to split the extent into two or to combine the extent with another extent, thereby defragmenting the extent. [0056] As can be understood from the above description, by creation of delta authority files at the merge stage, the data merging system 100 may avoid redundant rewrite of unchanged data whenever an update or a new record is ingested in the database.
  • the data merging system 100 may therefore utilize the system resources optimally and reduce load on the resources that may be created due to reading or rewriting the cold data. Further, as described above, due to writing the updated or new extents, the rewrite time is reduced and therefore the data merging system 100 provides fresh query output to a query. In addition, as the cold data rewrite is avoided, the data merging system 100 prevents duplication of the cold data and facilitates in improving the disk space utilization. [0057] Referring now to FIG. 3, a schematic diagram 300 depicting the extent based merge is illustrated, according to an example implementation of the present subject matter. As may be seen, a left block 302 may include an index file 304 that has pointers towards the extents of the delta files 306 that together make up an authority file.
  • the delta files 306 may include a plurality of extents 308, such as eight extents. Further, each extent may include a plurality of rows of data.
  • the index file 304 may be considered as a first or original index file which points towards eight extents 308.
  • the eight extents 308 may be distributed in three delta files 306 instead of being written as one authority file.
  • a middle block 310 illustrates how the index file 304 and the delta files 306 may have to be changed when the left block 302 is affected by incoming data 312. As depicted in FIG.3, the incoming data 312 affects two existing records as well as adds new data in the index file 304 corresponding to a new extent to be created.
  • the incoming data may relate to a plurality of rows of data including rows to be updated and new rows to be added. Updating of rows can include modification or deletion of rows.
  • the rows to be updated can be identified and the extents including those rows can be updated.
  • the new rows can either be merged into an existing extent, thereby creating an updated extent, or a new extent can be created to store the new rows. This is done based on the (key, value) pair of the new row of data. For example, if the primary key of the new rows is accommodated either at the beginning of the extents or at the end of the extents in an authority file, the new extent may be created before the first extent or after the last extent of the authority file.
  • the new rows may be merged with one of the two extents or a new extent may be created between the two extents.
  • the right block 314 illustrates the result of the merge operation.
  • a new index file 316 is created.
  • a new delta file 318 is created.
  • the new delta file 318 contains the two updated extents as well as a new extent that was created as a result of the new data in the incoming data 312.
  • the extents that were updated by the incoming data 312 became orphan 320 in the delta file 306 as the new index file 316 does not point towards those extents.
  • the new index file 316 points towards the updated extents created in the new delta file 318.
  • the data merging system 100 facilitates in rewriting the updated extents or writing the new extents and creating the new index file. Thereby, the data merging system 100 reduces the time spent in rewriting the data as well as prevents duplication of data by rewriting the affected extents and not the entire authority file.
  • the previous generation authority file and its corresponding index are available to perform the search and provide the search results.
  • the time taken to merge the data is substantially reduced, such instances where the search result is provided based on old data are less.
  • FIGS.4 and 5 illustrate methods 400 and 500 for merging incoming data in a database, according to example implementations of the present subject matter.
  • the order in which the methods 400 and 500 are described is not intended to be construed as a limitation, and some of the described method blocks can be combined in a different order to implement the methods 400 and 500, or an alternative method. Additionally, individual blocks may be deleted from the methods 400 and 500 without departing from the spirit and scope of the subject matter described herein.
  • the methods 400 and 500 may be implemented in a suitable hardware, computer-readable instructions, or combination thereof.
  • the methods 400 and 500 may be performed by either a computing device under the instruction of machine executable instructions stored on a computer readable medium or by dedicated hardware circuits, microcontrollers, or logic circuits.
  • some examples are also intended to cover computer readable medium, for example, digital data storage media, which are machine or computer readable and encode machine-executable or computer-executable instructions, where said instructions perform some or all of the steps of the described methods 400 and 500.
  • a request for merging incoming data in an authority file is received.
  • the authority file represents a fully materialized view of a non-relational database, such as a NoSQL database.
  • the incoming data may include update data or new data or a combination thereof.
  • the comparison module 212 may receive the request for merging the incoming data in the authority file.
  • each authority file may include at least one extent.
  • An extent may include a group of contiguous blocks of data.
  • the updated authority file generation module 104 may determine, based on the request, whether an extent is to be modified in the authority file. [0066] Further, at block 406, based on the determination, a delta authority file is created to store the incoming data as an updated extent, a new extent, or a combination of the two. In an implementation, the updated authority file creation module 104 may create a delta authority file based on the determination. Based on the incoming data, the delta authority file may include at least an updated extent or a new extent. [0067] Referring to FIG. 5, at block 502, a request to merge the incoming data in an authority file is received. The incoming data may include new data or update data or a combination thereof.
  • the authority file represents a fully materialized view of the database and may include a plurality of extents.
  • the request is received by the comparison module 212.
  • the incoming data is compared with an index file of the authority file.
  • the index file may include information, such as a range of primary keys stored in each extent and an offset of the extent.
  • the index file may include pointers to the extents of the authority file.
  • the comparison module 212 may compare the incoming data with the index file.
  • based on the comparison it is determined whether an extent is to be modified in the authority file.
  • the extents of the authority file may either be updated or new extents may be created or a combination of the two may be performed.
  • the updated authority file creation module 104 may determine whether an extent is to be created or updated in the authority file.
  • a delta authority file is created, based on the determination.
  • the delta authority file may store the incoming data as an updated extent or a new extent or a combination of the two.
  • the delta authority file may include the extents that are affected by the incoming data and may not include those extents that have not been modified by the incoming data.
  • the updated authority file creation module 104 may create the delta authority file, based on the determination.
  • extents from various delta authority files are identified for defragmentation.
  • the extents that may be defragmented may be identified, based on a plurality of parameters.
  • the plurality of parameters may include a number of valid rows in an extent, a minimum number of rows in the extent, and a maximum number of delta files for optimum performance. For example, when the number of valid rows in an extent and the minimum number of rows in the extent crosses respective pre-defined threshold values, the extents may be identified for defragmentation.
  • the defragmentation module 106 may identify the extents from the delta authority files for defragmentation.
  • the identified extents may be defragmented.
  • the defragmentation may be initiated at a subsequent request for merging.
  • the defragmentation module 106 may defragment the identified extents by consolidating the data in the identified extents into a lesser number of extents having an optimum number of rows of data to obtain the desired performance.
  • FIG.6 illustrates an example network environment 600 implementing a non-transitory computer readable medium 602 for merging incoming data in a database, according to an example implementation of the present subject matter.
  • the network environment 600 may be a public networking environment or a private networking environment.
  • the network environment 600 includes a processing resource 604 communicatively coupled to the non-transitory computer readable medium 602 through a communication link 606.
  • the processing resource 604 can be a processor of a computing system, such as the data merging system 100.
  • the non-transitory computer readable medium 602 can be, for example, an internal memory device or an external memory device.
  • the communication link 606 may be a direct communication link, such as one formed through a memory read/write interface.
  • the communication link 606 may be an indirect communication link, such as one formed through a network interface.
  • the processing resource 604 can access the non-transitory computer readable medium 602 through a network 608.
  • the network 608 may be a single network or a combination of multiple networks and may use a variety of communication protocols.
  • the processing resource 604 and the non-transitory computer readable medium 602 may also be communicatively coupled to data source 610 over the network 608.
  • the data source 610 can include, for example, databases and computing devices.
  • the data source 610 may be used by the database administrators and other users to communicate with the processing resource 604.
  • the non-transitory computer readable medium 602 includes a set of computer readable instructions, such as the updated authority file creation module 104 and the defragmentation module 106.
  • the set of computer readable instructions can be accessed by the processing resource 604 through the communication link 606 and subsequently executed to perform acts for network service insertion.
  • the execution of the instructions by the processing resource 604 has been described with reference to various components introduced earlier with reference to description of FIGS.1 and 2.
  • the comparison module 212 may, based on a request to merge incoming data in an authority file, compare the incoming data with an index file of the authority file.
  • the incoming data may include new data, the update data, or a combination thereof.
  • the index file may include pointers directing towards the extents of the authority file.
  • the authority file represents a fully materialized view of data in the database and may include at least one extent. The at least one extent may include a group of contiguous blocks of data.
  • the updated authority file creation module 104 may create a delta authority file.
  • the delta authority file may store the incoming data as an updated extent or a new extent or a combination of both.
  • the authority file along with the delta authority file may be considered as an updated authority file.
  • an updated index file is also created.
  • the updated index file may include pointers to the updated extents and the new extents along with the pointers for the unchanged extents.
  • the defragmentation module 106 may defragment the updated authority file based on a plurality of parameters.
  • the plurality of parameters may include a number of valid rows in an extent, a minimum number of rows in the extent, and a maximum number of delta files for optimum performance.
  • the identification of the extents for defragmentation is done in one merge cycle and the defragmentation of the extents is initiated in the next merge cycle.

Abstract

Upon receipt of incoming data for being merged in an authority file, a delta authority file is created. The authority file represents a fully materialized view of data in a database and may include an extent having a group of contiguous blocks of data. The incoming data may include one of a new data, an update data, and a combination thereof. Further, the delta authority file may store the incoming data as an updated extent, a new extent, and a combination thereof.

Description

MERGING INCOMING DATA IN A DATABASE
BACKGROUND
[0001] Non-relational databases, such as NoSQL databases, generally employ a pipeline architecture that processes incoming data through multiple stages before storing the incoming data. The multiple stages may include an ingest stage, a sort stage, and a merge stage. BRIEF DESCRIPTION OF DRAWINGS
[0002] The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figures to reference like features and components: [0003] FIG.1 illustrates components of a data merging system, according to an example implementation of the present subject matter. [0004] FIG. 2 illustrates a network implementation of the data merging system, according to another example implementation of the present subject matter. [0005] FIG. 3 illustrates a schematic diagram depicting the extent based merge, according to another example implementation of the present subject matter. [0006] FIG.4 illustrates a method for merging incoming data in a database, according to an example implementation of the present subject matter. [0007] FIG.5 illustrates a method for merging incoming data in a database, according to another example of the present subject matter. [0008] FIG.6 illustrates a computer readable medium storing instructions for merging incoming data in a database, according to an example implementation of the present subject matter. DETAILED DESCRIPTION
[0009] Non-relational databases, such as NoSQL databases, can store fully materialized view of data in files referred to as authority files. A materialized view may be understood as a database object that may include the results of a query. A fully materialized view may be understood as a materialized view of the database with the complete data of the database, where the data may be sorted as per various fields based on the query to which the materialized view pertains. Further, the non-relational databases can include a plurality of authority files, where each authority file includes the complete data of the database, but is sorted as per different queries. [0010] In one example, an authority file may store the data in the form of extents and can be indexed based on the extents for search and retrieval of data. An extent may include a group of contiguous blocks of data stored as rows. While storing the data, the non-relational databases employ a pipelined architecture in which incoming rows are passed through ingest, sort, and merge stages before the incoming data is stored in the database. In the ingest stage and the sort stage, the incoming data is received and sorted. During the merge stage, the sorted data is merged with existing data of the NoSQL database. In an example, the incoming data may be merged in accordance with a data format of the NoSQL database. [0011] During the merge stage, rows of the authority file are first read and then the authority file is rewritten based on the incoming data. In a high volume dataset, the authority file may grow to a large size, such as of the order of a few gigabytes. In such cases, during the merge stage, rewriting the authority file, every time there is an update in the form of incoming data, may be resource consuming as the extents of the authority file that are cold, i.e., not accessed or modified for a long time, are also read and rewritten. [0012] In cases where the authority file contains a large number of records, such as of the order of millions of records, rewriting the authority file can be further time consuming. In certain cases, such rewrite may take a few hours to get completed. Consequently, in situations where the database has multiple authority files, the time spent in rewriting the authority files may increase proportionally. Moreover, when a query is received during the rewrite, the non-relational database responds to the query based on a previous version of the authority file. This may result in reduced freshness of the results as the query is responded to on the basis of old or stale data. Furthermore, frequent rewrite of the authority files may cause a high amount of data duplication, thereby resulting in high disk footprint. [0013] According to various examples, systems and methods for merging incoming data in a database, such as a NoSQL database, are disclosed. The present subject matter may include a data merging system. In one example, the data merging system may receive a request for merging the incoming data in an authority file. The incoming data is the data received by the data merging system for being stored in the database. As a non-relational database may store data as key-value pairs, in response to the request, the data merging system may determine where the incoming data is to be stored based on the key value pairs. For example, the data merging system may determine whether an extent of the authority file is to be modified based on the incoming data or whether a new extent is to be created for the incoming data or both actions are to be done, i.e., a part of the incoming data is to be stored in a new extent and another part of the incoming data is to be stored in a modified extent. [0014] In an example, to determine whether an extent is to be modified to store the incoming data, the data merging system may compare the incoming data with an index file of the authority file. The index file may include information, such as a range of primary keys stored in each extent and an offset of each extent. Based on the determination, the data merging system may create a delta authority file to store the incoming data in an updated extent or a new extent or partly in both. Further, upon creation of the delta authority file, a new index file is also created. The new index file includes pointers to the new and updated extents of the delta authority file along with the pointers for unmodified existing extents from the authority file. The delta authority file together with the unmodified existing extents forms the new authority file that can be searched using the new index file. Thus, the new authority file can be created and accessed without having to rewrite unmodified extents. [0015] Further, as there may be a plurality of extents that may be created during each update of the database, the data merging system may also defragment the extents. Defragmentation refers to consolidation of fragmented contiguous data for better disk utilization and efficiency. The data merging system may identify the extents for defragmentation, based on a plurality of parameters. The plurality of parameters may include a number of valid rows in the extent, a minimum number of rows in the extent, and a maximum number of delta files. Valid rows may be understood as rows that are not marked for deletion. In one example, when the number of valid rows in the extent or the minimum number of rows in the extent becomes less than a pre-defined threshold value, the data merging system may identify those extents for defragmentation. [0016] Accordingly, the data merging system of the present subject matter provides for better use of system resources by avoiding redundant rewrite of an authority file whenever there is a merge request. Further, as the extents having cold data are not read or rewritten, there is no duplication of such extents and hence the data merging system may facilitate in better utilization of disk space. In addition, as the rewrite is performed for the updated extents or new extents, the merge time of the incoming data is reduced. Therefore, more queries can be responded to, based on fresh data and better search results can be provided. [0017] The various systems and the methods are further described in conjunction with the following figures. It should be noted that the description and figures merely illustrate the principles of the present subject matter. Further, various arrangements may be devised that, although not explicitly described or shown herein, embody the principles of the present subject matter and are included within its scope. [0018] The manner in which the systems and the methods for merging incoming data in a database are implemented are explained in detail with respect to FIG. 1, FIG. 2, FIG. 3, FIG. 4, FIG. 5, and FIG. 6. While aspects of described systems and methods for merging incoming data in a database can be implemented in number of different computing systems, environments, and/or implementations, the examples and implementations are described in the context of the following system(s). [0019] FIG. 1 illustrates the components of a data merging system 100, according to an example of the present subject matter. In one example, the data merging system 100 may be implemented as a computing system, such as a desktop, a laptop, a server, and the like. In an example, the data merging system 100 can be implemented in a network environment comprising a variety of network devices, including routers, bridges, servers, computing devices, storage devices, etc. While the data merging system 100 is shown as a separate system, it may be understood that the data merging system 100 can be a part of a computing system implementing a pipeline architecture for management of non-relational databases.
[0020] In one implementation, the data merging system 100 includes a processor 102. The processor 102 may be implemented as microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any other devices that manipulate signals and data based on computer-readable instructions. Further, functions of the various elements shown in FIG. 1, including functional blocks labeled as “processor(s)”, may be provided through the use of dedicated hardware as well as hardware capable of executing computer-readable instructions. [0021] The data merging system 100 may include an updated authority file creation module 104 and a defragmentation module 106, coupled to the processor 102. The updated authority file creation module 104 and the defragmentation module 106, amongst other things, include routines, programs, objects, components, and data structures, which perform particular tasks or implement particular abstract data types. The updated authority file creation module 104 and the defragmentation module 106 may also be implemented as, signal processor(s), state machine(s), logic circuitries, and/or any other device or component that manipulates signals based on operational instructions. Further, the updated authority file creation module 104 and the defragmentation module 106 can be implemented by hardware, by computer-readable instructions executed by a processing unit, or by a combination thereof. [0022] Upon receipt of the incoming data, the updated authority file creation module 104 may determine whether an extent is to be modified in an authority file. In an example, the incoming data may be understood as data received by the data merging system 100 for being stored in the database, such as a NoSQL database, which stores data in (key, value) format. The incoming data may include update data or new data or a combination of the two. For example, the incoming data may relate to a plurality of rows of data including rows to be updated and new rows to be added. Updating of rows can include modification of data in the rows or deletion of the rows. Based on the (key, value) pair, the rows to be updated can be identified and the extents including those rows can be updated. [0023] In addition, the new data may include data that will be stored in the database for the first time. The new data may include new rows that can be either merged into an existing extent, thereby creating an updated extent, or a new extent can be created to store the new rows. This is performed based on the (key, value) pair of the new row of data. For example, if the primary key of the new row is less than the smallest primary key of a first extent or more than the largest primary key of a last extent in an authority file, the new extent may be created before the first extent or after the last extent of the authority file. On the other hand, if the primary key of the new row can be accommodated between two extents, i.e., lies between the largest primary key of one extent and the smallest primary key of the next extent, the new rows may be merged with one of the two extents or a new extent may be created between the two extents. [0024] In operation, the updated authority file creation module 104 may generate an updated authority file, based on the determination of whether or not an extent is to be modified in an authority file. In an example, the updated authority file may store the incoming data either as an updated extent or a new extent or partly in both. It will be understood that more than one extent may be updated or newly created and the term“an” includes one or more. In one example, the incoming data may include update of data stored in an existing extent as well as new data. In this case, the updated authority file may store the incoming data as the updated extent, the new extent, or a combination of the two. The updated extent may include existing rows as well as new rows that are being merged in the database or previously existing rows that are being updated in the database. The updated authority file creation module 104 may rewrite the unchanged existing rows of the extent and also write the new or updated rows in the same extent to generate the updated extent. In case the incoming data is to be stored in the new extent, the updated authority file creation module 104 may write the new rows in the new extent without writing existing rows. [0025] In an implementation, the authority file, when updated with the incoming data, results in generation of delta authority files. The delta authority files include the extents that have been affected by the incoming data and do not include existing extents that have not been affected. Details pertaining to the delta authority files are provided in conjunction with FIGS.2 and 3. It may be noted that the delta authority files and an existing authority file may together be referred to as the updated authority file. [0026] Once the updated authority file is created with the new extents or the updated extents, the data merging system 100 may defragment the extents so created. In an implementation, the defragmentation module 106 may identify the extents from the updated authority files for defragmentation. In an example, the defragmentation module 106 may identify the extents based on a plurality of parameters. The plurality of parameters may include a number of valid rows in an extent, a minimum number of rows in the extent, and a maximum number of delta files supported. In one example, when the number of valid rows in an extent or the minimum number of rows in the extent crosses a pre-defined threshold value, the defragmentation module 106 may identify such extents for defragmentation. [0027] FIG. 2 illustrates a network environment 200 including the data merging system 100 according to another example of the present subject matter. As mentioned previously, the data merging system 100 may be implemented in various computing systems, such as personal computers, servers, etc. The data merging system 100 may be implemented on a network interfaced computing system. In one example, the data merging system 100 may communicate with a plurality of devices 202-1, 202-2,… , 202-N over a network 204 and may receive the incoming data from the devices 202. The devices 202-1, 202-2,…, 202-N can be collectively referred to as devices 202 and individually referred to as a device 202 hereinafter. The devices 202 can include, but are not restricted to, desktop computers, laptops, data servers, and the like. In an implementation, the data merging system 100 may initiate the backup of the at least one file stored in the devices 202. [0028] The network 204 may be a wired network, a wireless network or a combination of a wired and wireless network. The network 204 can also be a collection of individual networks, which may use different protocols for communication, interconnected with each other. Further, the network 204 can include various network elements, such as gateways, modems, routers; however, such details have been omitted for ease of understanding. In one example, the network 204 may be a private network, such as an enterprise network, or a public network, such as a cloud network, or a hybrid network. [0029] In an implementation, the data merging system 100 includes the processor 102 and a memory 206 connected to the processor 102. Among other capabilities, the processor(s) 102 may fetch and execute computer-readable instructions stored in the memory 206. The memory 206, communicatively coupled to the processor 102, can include any non-transitory computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[0030] The data merging system 100 also includes interface(s) 208. The interface(s) 208 may include a variety of interfaces, for example, interface(s) 208 for user device(s), such as the devices 202 and network devices of the network 204. The interface(s) 208 may include data input and output devices, referred to as I/O devices. The interface(s) 208 facilitate the communication of the data merging system 100 with various communication and computing devices and various communication networks, such as networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP) and Transmission Control Protocol/Internet Protocol (TCP/IP). [0031] The data merging system 100 also includes modules 210. Further, in addition to the updated authority file creation module 104 and the defragmentation module 106, the modules 210 may include a comparison module 212, and other module(s) 214. The other module(s) 214 may include programs or coded instructions that supplement the applications or functions performed by the data merging system 100. The modules 210 may be implemented as described in relation to FIG.1. [0032] In an example, the data merging system 100 includes data 216. The data 216 may include incoming data 218, defragmentation data 220, and other data 222. The other data 222 may include data generated and saved by the modules 210 for implementing various functionalities of the data merging system 100. [0033] In an example, the comparison module 212 may receive a request to merge the incoming data in the authority file. The incoming data may be new data for being added in the database or may be an update of the data already present in the database. In an example, the incoming data may be a combination of both the new data and the update data. The merge request may be received as a part of the pipeline architecture of the non-relational database, such as the NoSQL database. In the pipeline architecture, the data is passed through three stages, namely an ingest stage, a sort stage, and a merge stage. The ingest stage of the pipeline architecture collects the incoming data from various sources and segregates them into batches of one or more unsorted data structures. [0034] In the sort stage, the ingested data may be sorted, for example, on the basis of (key, value) pairs. The sorting of the data may result in creation of searchable tables. Further, in the merge stage, individual searchable table is combined together as well as with existing tables to form the authority files. In an example, the authority file represents a fully materialized view of the database and is searchable by the data merging system 100 in response to a user query. The authority file may include at least one extent. An extent may include a group of contiguous blocks of data stored as rows. Accordingly, one authority file may include multiple extents. An example authority data file is depicted in Table 1 below:
Figure imgf000012_0001
[0035] The data merging system 100, in the merge stage, rewrites the modified extents into a new partial authority file hereinafter referred to as a delta authority file. The delta authority file may include subset of the data of the authority file, for example, in the form of the modified extents. The modified extents may include new extents and updated extents, based on the incoming data. At the merge stage, a new index file is also created. The new index file may include pointers for multiple delta authority files, along with the names of the delta authority files and the offset of the extent in the delta authority files, and pointers for unmodified extents. The various delta authority files and the authority file together form the updated authority file. [0036] In one example, upon receiving the merge request, the comparison module 212 may compare the incoming data with an index file of the authority file. The index file may include information, such as a range of primary keys stored in each extent and an offset of each extent. In addition, the index file may include pointers to the extents of the authority file. An example index file is depicted in Table 2 below:
Figure imgf000013_0001
[0037] As may be seen from Table 2, the index file includes the minimum and maximum value of primary key in each extent in the authority file. The index file also points towards extent offsets of the authority file. The index file therefore facilitates in determining which primary key is stored in which extent. The index file of the present subject matter may include pointers directed to extents of more than one delta authority file. From the comparison, the comparison module 212 determines the extents in which the incoming data is to be merged. [0038] The updated authority file creation module 104 may, based on the comparison, determine whether an extent is to be modified. In an example, when the updated authority file creation module 104 determines that the primary key of a row of incoming data fits between the primary keys of two extents of the authority file, the delta authority file that is created may either include the incoming data of that row in an updated extent or a new extent. For instance, the updated extent may be created when the incoming data is merged with either of the two extents between which the primary key of the incoming data fits. In this case, the incoming data may be accommodated in an existing extent for which the extent may be written to a delta authority file, along with the incoming data, as an updated extent. In another instance, a new extent may be created and the incoming data may be accommodated between the two extents in the new extent containing the primary key for which the data is to be stored. [0039] In another example, the primary key of the incoming data may fit within the range of primary keys of an existing extent, in which case the incoming data will be accommodated in the extent by rewriting the extent along with the incoming data. [0040] In yet another example, when the primary key of the incoming data does not fit between two extents of the authority file or within the primary key range of an existing extent, the delta authority file may store the incoming data as a new extent. [0041] As mentioned above, the delta authority file may include a plurality of extents including a combination of updated extents and new extents. In an example, the delta authority file along with the previous generation authority file may be referred to as an updated authority file. The previous generation authority file may itself include a delta authority file that was created when an earlier generation authority file was updated. [0042] In one example, as discussed, if the primary key of the incoming data fits between two extents of the authority file, the incoming data may be stored with one of the two extents. Referring to Table 3 below, an example index file is depicted:
Figure imgf000014_0001
[0043] As may be seen the index file of Table 3 represents the authority file as depicted in Table 1 above. When a row of incoming data has the primary key as 140, the incoming data corresponding to that row may be inserted either with extent 001 or with extent 002 or as a new extent between 001 and 002. In an implementation, the determination of whether the incoming data is to be merged with an existing extent and with which existing extent is based on a penalty that may be associated with the merge operation. The penalty may be understood as a loss that may be caused in terms of time consumed or resources utilized in performing an action. For example, if the extent 001 contains 100 rows and the extent 002 contains 50 rows, the penalty of rewriting 100 rows will be more in terms of the time taken and resources consumed than in the case of rewriting 50 rows. [0044] In another example, when the existing extents cross the configurable size limit in the authority file, the incoming data may be written in a new extent. The configurable size limit may be defined by a system administrator. Accordingly, the updated authority file creation module 104 may determine whether, based on the incoming data, the existing extents are to be modified or a new extent is to be created. Table 4 below depicts an example incoming data including two rows of data with corresponding key, value pairs:
Figure imgf000015_0001
[0045] As may be seen from Table 4, the incoming data includes a new key, value pair (140, X1) as well as updated value of an existing key (220, X2). In an example, the updated authority file creation module 104 may store the incoming data as the incoming data 218. [0046] In one example, when the new key value pair in the incoming data 218 is accommodated as a new extent and the existing key value pair is updated, a new index file and a delta authority file may be created as depicted in Tables 5 and 6 respectively:
Figure imgf000016_0002
Figure imgf000016_0001
[0047] As can be seen, in the new index file in table 5, index entries are included to indicate the new extent having the key 140 and the updated extent which is written with the updated key value pair of the key 220 and the index entries of unchanged extents remain unchanged. Further, the delta authority file as shown in table 6 includes the changed extents, i.e., the new extent and the updated extent. The delta authority file does not include the unchanged extents. [0048] In another example, when the new data in the incoming data 218 is accommodated in the existing extent 002 in which the update of existing data is also to be accommodated, a new index file and a delta authority file may be created as depicted in Tables 7 and 8 respectively:
Figure imgf000017_0001
Figure imgf000017_0002
[0049] As can be seen, in the new index file, the entry corresponding to earlier extent 002 has changed while the entries corresponding to unchanged extents remain the same. Further, the delta authority file includes the earlier extent 002 with both the new data row having primary key 140 and the updated value of the value corresponding to primary key 220. [0050] It will be understood that the incoming data 218 may have multiple rows of data for which extents may be modified, new extents may be created, or a combination of both may be done. Accordingly the delta authority file that is created may have multiple new and modified extents. [0051] Thus, the updated authority file creation module 104 may generate an updated authority file. In an example, the updated authority file may include at least one delta authority file and the authority file from which the delta authority file was created. [0052] Further, due to changes or updates in the database over a period of time, a plurality of delta files may be created. Such creation of the plurality of delta files may result in fragmentation of data, thereby resulting in slower query performance. The defragmentation module 106 may facilitate in keeping the fragmentation within an acceptable boundary. The defragmentation module 106 may identify the extents for defragmentation. In an example, in one merge cycle the extents may be identified for defragmentation, and in the next merge cycle the defragmentation may be performed. [0053] In an example, the identification of the extents for defragmentation may be based on a plurality of parameters. For example, the plurality of parameters may include a number of valid rows in the extent, a minimum number of rows in the extent, and a maximum number of delta files created for optimum performance. For example, when the number of valid rows in the extent or the minimum number of rows in the extent crosses a pre-defined threshold value, the defragmentation module 106 may identify the extents for defragmentation. In another example, when the maximum number of delta files created exceeds the pre-defined threshold limit, the defragmentation module 106 may identify the extents for defragmentation. [0054] In an implementation, the defragmentation module 106 may store identifiers for the identified extents as the defragmentation data 220. In an example, the defragmentation data 220 may be stored by the defragmentation module 106 at the end of the new index file and may be marked as identified for defragmentation. In a next merge cycle, the defragmentation module 106 may read the defragmentation data 220 and the identified extents may be defragmented by the defragmentation module 106. [0055] Further, the defragmentation module 106 may also consider an optimum size of an extent, before defragmenting the same. In an example, the optimum size of the extent may be user defined based on various parameters, such as system configuration and frequency of defragmentation desirable. For example, if a working interval for an extent is from 70% to 130% of the defined extent size, such as, 64KB, the defragmentation module 106 may not initiate defragmentation while the size of the extent is inside this interval. On the other hand, when the size of the extent goes outside the working interval, the defragmentation module 106 may decide whether to split the extent into two or to combine the extent with another extent, thereby defragmenting the extent. [0056] As can be understood from the above description, by creation of delta authority files at the merge stage, the data merging system 100 may avoid redundant rewrite of unchanged data whenever an update or a new record is ingested in the database. The data merging system 100 may therefore utilize the system resources optimally and reduce load on the resources that may be created due to reading or rewriting the cold data. Further, as described above, due to writing the updated or new extents, the rewrite time is reduced and therefore the data merging system 100 provides fresh query output to a query. In addition, as the cold data rewrite is avoided, the data merging system 100 prevents duplication of the cold data and facilitates in improving the disk space utilization. [0057] Referring now to FIG. 3, a schematic diagram 300 depicting the extent based merge is illustrated, according to an example implementation of the present subject matter. As may be seen, a left block 302 may include an index file 304 that has pointers towards the extents of the delta files 306 that together make up an authority file. The delta files 306 may include a plurality of extents 308, such as eight extents. Further, each extent may include a plurality of rows of data. The index file 304 may be considered as a first or original index file which points towards eight extents 308. The eight extents 308 may be distributed in three delta files 306 instead of being written as one authority file. Further, a middle block 310 illustrates how the index file 304 and the delta files 306 may have to be changed when the left block 302 is affected by incoming data 312. As depicted in FIG.3, the incoming data 312 affects two existing records as well as adds new data in the index file 304 corresponding to a new extent to be created. [0058] In an example, the incoming data may relate to a plurality of rows of data including rows to be updated and new rows to be added. Updating of rows can include modification or deletion of rows. Based on the (key, value) pair, the rows to be updated can be identified and the extents including those rows can be updated. The new rows can either be merged into an existing extent, thereby creating an updated extent, or a new extent can be created to store the new rows. This is done based on the (key, value) pair of the new row of data. For example, if the primary key of the new rows is accommodated either at the beginning of the extents or at the end of the extents in an authority file, the new extent may be created before the first extent or after the last extent of the authority file. On the other hand, if the primary key of the new rows can be accommodated between two extents, the new rows may be merged with one of the two extents or a new extent may be created between the two extents. [0059] The right block 314 illustrates the result of the merge operation. As a result of the incoming data 312, a new index file 316 is created. Further, a new delta file 318 is created. The new delta file 318 contains the two updated extents as well as a new extent that was created as a result of the new data in the incoming data 312. In an example, the extents that were updated by the incoming data 312 became orphan 320 in the delta file 306 as the new index file 316 does not point towards those extents. The new index file 316 points towards the updated extents created in the new delta file 318. [0060] Accordingly, the data merging system 100 facilitates in rewriting the updated extents or writing the new extents and creating the new index file. Thereby, the data merging system 100 reduces the time spent in rewriting the data as well as prevents duplication of data by rewriting the affected extents and not the entire authority file. [0061] Further, in case a search query is received while the data is being merged, the previous generation authority file and its corresponding index are available to perform the search and provide the search results. Moreover, as the time taken to merge the data is substantially reduced, such instances where the search result is provided based on old data are less. [0062] FIGS.4 and 5 illustrate methods 400 and 500 for merging incoming data in a database, according to example implementations of the present subject matter. The order in which the methods 400 and 500 are described is not intended to be construed as a limitation, and some of the described method blocks can be combined in a different order to implement the methods 400 and 500, or an alternative method. Additionally, individual blocks may be deleted from the methods 400 and 500 without departing from the spirit and scope of the subject matter described herein. Furthermore, the methods 400 and 500 may be implemented in a suitable hardware, computer-readable instructions, or combination thereof.
[0063] The methods 400 and 500 may be performed by either a computing device under the instruction of machine executable instructions stored on a computer readable medium or by dedicated hardware circuits, microcontrollers, or logic circuits. Herein, some examples are also intended to cover computer readable medium, for example, digital data storage media, which are machine or computer readable and encode machine-executable or computer-executable instructions, where said instructions perform some or all of the steps of the described methods 400 and 500.
[0064] With reference to method 400 as depicted in FIG.4, at block 402, a request for merging incoming data in an authority file is received. In an example, the authority file represents a fully materialized view of a non-relational database, such as a NoSQL database. Further, the incoming data may include update data or new data or a combination thereof. In an implementation, the comparison module 212 may receive the request for merging the incoming data in the authority file. [0065] At block 404, in response to the request, it is determined whether an extent is to be modified in the authority file. In an example, each authority file may include at least one extent. An extent may include a group of contiguous blocks of data. In an implementation, the updated authority file generation module 104 may determine, based on the request, whether an extent is to be modified in the authority file. [0066] Further, at block 406, based on the determination, a delta authority file is created to store the incoming data as an updated extent, a new extent, or a combination of the two. In an implementation, the updated authority file creation module 104 may create a delta authority file based on the determination. Based on the incoming data, the delta authority file may include at least an updated extent or a new extent. [0067] Referring to FIG. 5, at block 502, a request to merge the incoming data in an authority file is received. The incoming data may include new data or update data or a combination thereof. Further, the authority file represents a fully materialized view of the database and may include a plurality of extents. In an implementation, the request is received by the comparison module 212. [0068] At block 504, the incoming data is compared with an index file of the authority file. The index file may include information, such as a range of primary keys stored in each extent and an offset of the extent. In addition, the index file may include pointers to the extents of the authority file. In an implementation, the comparison module 212 may compare the incoming data with the index file. [0069] At block 506, based on the comparison it is determined whether an extent is to be modified in the authority file. As the incoming data may include the new data or the update data or the combination thereof, the extents of the authority file may either be updated or new extents may be created or a combination of the two may be performed. In an implementation, the updated authority file creation module 104 may determine whether an extent is to be created or updated in the authority file. [0070] At block 508, a delta authority file is created, based on the determination. The delta authority file may store the incoming data as an updated extent or a new extent or a combination of the two. In an example, the delta authority file may include the extents that are affected by the incoming data and may not include those extents that have not been modified by the incoming data. In an implementation, the updated authority file creation module 104 may create the delta authority file, based on the determination. [0071] At block 510, extents from various delta authority files are identified for defragmentation. As the creation of delta files can result in fragmentation of the data in the database, the extents that may be defragmented may be identified, based on a plurality of parameters. In an example, the plurality of parameters may include a number of valid rows in an extent, a minimum number of rows in the extent, and a maximum number of delta files for optimum performance. For example, when the number of valid rows in an extent and the minimum number of rows in the extent crosses respective pre-defined threshold values, the extents may be identified for defragmentation. Also, if the maximum number of delta files supported for desired performance, such as speed of search and retrieval, and disk space utilization, of a system crosses a threshold value, the extents are identified for defragmentation. In an implementation, the defragmentation module 106 may identify the extents from the delta authority files for defragmentation. [0072] At block 512, the identified extents may be defragmented. In an example, the defragmentation may be initiated at a subsequent request for merging. In an implementation, the defragmentation module 106 may defragment the identified extents by consolidating the data in the identified extents into a lesser number of extents having an optimum number of rows of data to obtain the desired performance. In another example, the defragmentation module 106 may divide an extent into two in case the size of the extent crosses a threshold limit as discussed earlier. [0073] FIG.6 illustrates an example network environment 600 implementing a non-transitory computer readable medium 602 for merging incoming data in a database, according to an example implementation of the present subject matter. The network environment 600 may be a public networking environment or a private networking environment. In one implementation, the network environment 600 includes a processing resource 604 communicatively coupled to the non-transitory computer readable medium 602 through a communication link 606.
[0074] For example, the processing resource 604 can be a processor of a computing system, such as the data merging system 100. The non-transitory computer readable medium 602 can be, for example, an internal memory device or an external memory device. In one implementation, the communication link 606 may be a direct communication link, such as one formed through a memory read/write interface. In another implementation, the communication link 606 may be an indirect communication link, such as one formed through a network interface. In such a case, the processing resource 604 can access the non-transitory computer readable medium 602 through a network 608. The network 608 may be a single network or a combination of multiple networks and may use a variety of communication protocols.
[0075] The processing resource 604 and the non-transitory computer readable medium 602 may also be communicatively coupled to data source 610 over the network 608. The data source 610 can include, for example, databases and computing devices. The data source 610 may be used by the database administrators and other users to communicate with the processing resource 604.
[0076] In one implementation, the non-transitory computer readable medium 602 includes a set of computer readable instructions, such as the updated authority file creation module 104 and the defragmentation module 106. The set of computer readable instructions, referred to as instructions hereinafter, can be accessed by the processing resource 604 through the communication link 606 and subsequently executed to perform acts for network service insertion. [0077] For discussion purposes, the execution of the instructions by the processing resource 604 has been described with reference to various components introduced earlier with reference to description of FIGS.1 and 2. [0078] In an example, on execution by the processing resource 604, the comparison module 212 may, based on a request to merge incoming data in an authority file, compare the incoming data with an index file of the authority file. The incoming data may include new data, the update data, or a combination thereof. The index file may include pointers directing towards the extents of the authority file. Further, the authority file represents a fully materialized view of data in the database and may include at least one extent. The at least one extent may include a group of contiguous blocks of data. [0079] Based on the comparison, the updated authority file creation module 104 may create a delta authority file. The delta authority file may store the incoming data as an updated extent or a new extent or a combination of both. In addition, the authority file along with the delta authority file may be considered as an updated authority file. In an implementation, whenever the delta authority file is created, an updated index file is also created. The updated index file may include pointers to the updated extents and the new extents along with the pointers for the unchanged extents. [0080] As the delta authority files may result in the fragmentation of the data into multiple extents, the defragmentation module 106 may defragment the updated authority file based on a plurality of parameters. In an example, the plurality of parameters may include a number of valid rows in an extent, a minimum number of rows in the extent, and a maximum number of delta files for optimum performance. In an example, the identification of the extents for defragmentation is done in one merge cycle and the defragmentation of the extents is initiated in the next merge cycle. [0081] Although implementations of merging incoming data in a database have been described in language specific to structural features and/or methods, it is to be understood that the present subject matter is not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed and explained in the context of a few implementations for merging incoming data in a database.

Claims

I/We claim: 1. A method comprising:
receiving a request for merging incoming data in an authority file, wherein the authority file represents a fully materialized view of data in a database and comprises extents, each extent comprising a group of contiguous blocks of data, and wherein the incoming data comprises one of new data, update data, and a combination thereof;
in response to the request, determining whether an extent is to be modified in the authority file; and
creating a delta authority file based on the determining, wherein the delta authority file stores the incoming data as one of an updated extent, a new extent, and a combination thereof.
2. The method as claimed in claim 1, wherein the determining comprises comparing the incoming data with an index file of the authority file, and wherein the index file comprises pointers to the extents of the authority file.
3. The method as claimed in claim 2, wherein the index file further comprises a range of primary keys stored in each extent of the authority file and an offset of each extent.
4. The method as claimed in claim 1 further comprising, when the incoming data is the new data, storing the incoming data by one of: creating the updated extent to store the incoming data with one of the extents of the authority file and creating the new extent with the incoming data.
5. The method as claimed in claim 4, wherein the incoming data is stored with one of the extents of the authority file based on a penalty associated with creating the updated extent.
6. The method as claimed in claim 1, wherein the updated extent is created when the incoming data is the update data.
7. The method as claimed in claim 1 further comprising defragmenting identified extents of the delta authority file, wherein the defragmentation is initiated at a subsequent request for merging.
8. The method as claimed in claim 7, wherein the defragmenting comprises identifying the extents for defragmentation based on a plurality of parameters, wherein the plurality of parameters comprise a number of valid rows in the extent, a minimum number of rows in the extent, and a maximum number of delta files supported.
9. The method as claimed in claim 8, wherein the defragmenting is initiated when one of the number of valid rows in the extent and the minimum number of rows in the extent crosses respective pre-defined threshold values.
10. A data merging system comprises:
a processor;
an updated authority file creation module, executable by the processor, to determine whether an extent is to be modified in an authority file upon receipt of incoming data, wherein the authority file represents a fully materialized view of data in a database and comprises at least one extent, wherein the extent comprises a group of contiguous blocks of data; and based on the determination, generate an updated authority file, wherein the updated authority file stores the incoming data as one of an updated extent, a new extent, and a combination thereof;
a defragmentation module, executable by the processor, to,
identify the extents for defragmentation, wherein the identification is based on a plurality of parameters; and
defragment the identified extents when one of the plurality of parameters crosses respective pre-defined threshold value.
11. The data merging system as claimed in claim 10 further comprises a comparison module, executable by the processor, to,
receive a request to merge the incoming data in the authority file, wherein the authority file comprises at least one extent; and compare the incoming data with an index file of the authority file, wherein the index file comprises pointers to the extents of the authority file.
12. The data merging system as claimed in claim 10, wherein the updated authority file comprises unchanged extents of the authority file and the extents having the incoming data.
13. The data merging system as claimed in claim 10, wherein the plurality of parameters comprise a number of valid rows in the extent, a minimum number of rows in the extent, and a maximum number of delta files supported.
14. A non-transitory computer-readable medium having a set of computer readable instructions that, when executed, cause a processing resource to,
compare incoming data with an index file of a previous generation based on a request to merge the incoming data in an authority file, the index file comprising pointers to extents of the authority file,
wherein the authority file represents a fully materialized view of data in a database and comprises an extent having a group of contiguous blocks of data, the incoming data comprises one of a new data and an update data;
based on the comparison, create a delta authority file, wherein the delta authority file stores the incoming data as one of an updated extent, a new extent, and a combination thereof, and wherein the delta authority file and the authority file together form an updated authority file; and
defragment the delta authority file based on a plurality of parameters, wherein the defragmentation is initiated at a subsequent request for merging.
15. The non-transitory computer-readable medium as claimed in claim 14, wherein an updated index file is created when the delta authority file is created, and wherein the updated index file comprises pointers to the updated extent, the new extent, and the unchanged extents of the authority file.
PCT/US2015/047662 2015-04-29 2015-08-31 Merging incoming data in a database WO2016175880A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN2200/CHE/2015 2015-04-29
IN2200CH2015 2015-04-29

Publications (1)

Publication Number Publication Date
WO2016175880A1 true WO2016175880A1 (en) 2016-11-03

Family

ID=57199630

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/047662 WO2016175880A1 (en) 2015-04-29 2015-08-31 Merging incoming data in a database

Country Status (1)

Country Link
WO (1) WO2016175880A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228673A (en) * 2016-12-22 2018-06-29 上海凯翔信息科技有限公司 The method and system that a kind of file quickly merges
WO2019079998A1 (en) * 2017-10-25 2019-05-02 福建联迪商用设备有限公司 Method and terminal for managing and controlling permission of application, and pos terminal
CN113228000A (en) * 2018-10-26 2021-08-06 斯诺弗雷克公司 Incremental refresh of materialized views
EP3977304A4 (en) * 2019-05-31 2023-01-25 Snowflake Inc. Sharing materialized views in database systems

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080228697A1 (en) * 2007-03-16 2008-09-18 Microsoft Corporation View maintenance rules for an update pipeline of an object-relational mapping (ORM) platform
US20110196880A1 (en) * 2010-02-11 2011-08-11 Soules Craig A N Storing update data using a processing pipeline
CN103473239A (en) * 2012-06-08 2013-12-25 腾讯科技(深圳)有限公司 Method and device for updating data of non relational database
US20140279855A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Differentiated secondary index maintenance in log structured nosql data stores
US20140289188A1 (en) * 2013-03-15 2014-09-25 Factual, Inc. Apparatus, systems, and methods for batch and realtime data processing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080228697A1 (en) * 2007-03-16 2008-09-18 Microsoft Corporation View maintenance rules for an update pipeline of an object-relational mapping (ORM) platform
US20110196880A1 (en) * 2010-02-11 2011-08-11 Soules Craig A N Storing update data using a processing pipeline
CN103473239A (en) * 2012-06-08 2013-12-25 腾讯科技(深圳)有限公司 Method and device for updating data of non relational database
US20140279855A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Differentiated secondary index maintenance in log structured nosql data stores
US20140289188A1 (en) * 2013-03-15 2014-09-25 Factual, Inc. Apparatus, systems, and methods for batch and realtime data processing

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228673A (en) * 2016-12-22 2018-06-29 上海凯翔信息科技有限公司 The method and system that a kind of file quickly merges
CN108228673B (en) * 2016-12-22 2021-09-03 上海凯翔信息科技有限公司 Method and system for rapidly merging files
WO2019079998A1 (en) * 2017-10-25 2019-05-02 福建联迪商用设备有限公司 Method and terminal for managing and controlling permission of application, and pos terminal
CN113228000A (en) * 2018-10-26 2021-08-06 斯诺弗雷克公司 Incremental refresh of materialized views
EP3871102A4 (en) * 2018-10-26 2022-07-20 Snowflake Inc. Incremental refresh of a materialized view
US11461309B2 (en) 2018-10-26 2022-10-04 Snowflake Inc. Incremental refresh of a materialized view
US11809408B2 (en) 2018-10-26 2023-11-07 Snowflake Inc. Incremental refresh of a materialized view
EP3977304A4 (en) * 2019-05-31 2023-01-25 Snowflake Inc. Sharing materialized views in database systems
US11914591B2 (en) 2019-05-31 2024-02-27 Snowflake Inc. Sharing materialized views in multiple tenant database systems

Similar Documents

Publication Publication Date Title
US11797498B2 (en) Systems and methods of database tenant migration
US9672235B2 (en) Method and system for dynamically partitioning very large database indices on write-once tables
US7856437B2 (en) Storing nodes representing respective chunks of files in a data store
US7725437B2 (en) Providing an index for a data store
US8849876B2 (en) Methods and apparatuses to optimize updates in a file system based on birth time
US11442961B2 (en) Active transaction list synchronization method and apparatus
AU2017243870B2 (en) "Methods and systems for database optimisation"
US20200097205A1 (en) System and method for early removal of tombstone records in database
CN109189995B (en) Data redundancy elimination method in cloud storage based on MPI
US9946724B1 (en) Scalable post-process deduplication
US20190005101A1 (en) Method and apparatus for accessing time series data in memory
US11262929B2 (en) Thining databases for garbage collection
CN111427847A (en) Indexing and query method and system for user-defined metadata
WO2016175880A1 (en) Merging incoming data in a database
Cruz et al. A scalable file based data store for forensic analysis
Shen et al. An efficient LSM-tree-based SQLite-like database engine for mobile devices
WO2022001626A1 (en) Time series data injection method, time series data query method and database system
CN112965939A (en) File merging method, device and equipment
EP3436988B1 (en) "methods and systems for database optimisation"
Lee et al. Boosting compaction in B-tree based key-value store by exploiting parallel reads in flash ssds
US11119681B2 (en) Opportunistic compression
WO2022171291A1 (en) Method for cataloguing data items in a data storage system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15891008

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15891008

Country of ref document: EP

Kind code of ref document: A1