WO2017114263A1 - System, apparatus and method for avoiding data loss on recovery of bulk load - Google Patents

System, apparatus and method for avoiding data loss on recovery of bulk load Download PDF

Info

Publication number
WO2017114263A1
WO2017114263A1 PCT/CN2016/111335 CN2016111335W WO2017114263A1 WO 2017114263 A1 WO2017114263 A1 WO 2017114263A1 CN 2016111335 W CN2016111335 W CN 2016111335W WO 2017114263 A1 WO2017114263 A1 WO 2017114263A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
bulk
database
data
redo
Prior art date
Application number
PCT/CN2016/111335
Other languages
French (fr)
Inventor
Kalyan Sivakumar
Prasanna Venkatesh RAMAMURTHI
Mahesh Kumar BEHERA
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Publication of WO2017114263A1 publication Critical patent/WO2017114263A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1474Saving, restoring, recovering or retrying in transactions

Definitions

  • the present subject matter described herein in general, relates to bulk loading of databases, and more particularly, to systems, apparatuses and methods for performing an efficient bulk load into a database by avoiding data loss after a bulk load when checkpoint was not successful.
  • Databases are widely used technical means for storing and handling large amounts of data. Due to the increasing complications of databases the efficiency and reliability of the databases are of prime importance. An unexpected crash or inconsistency in a database can result in enormous losses not only because of the time necessary to restore the database from a backup, but in the worst case also in the loss of data.
  • the applications generally use two commonly known techniques of inserting data into the database.
  • One of the technique is using a traditional sequential query language (SQL) model, wherein the SQL insert queries are used for inserting the data.
  • the other technique is using a bulk load utility/operation provided by many of the database vendors, wherein the data is directly loaded from delimiter separated value files, for e.g. comma separated values (csv) .
  • Bulk load or bulk insert is a common mechanism of automating the process of loading a large amount of data into the database.
  • the bulk insert is a process or method provided by a database management system to load multiple rows of data into a database table.
  • the bulk load is widely used technique because most of the databases have tuned the bulk load process/operation to make it fast and efficient by adapting to various operational techniques such as no redo logging, parallel loading, online index update, etc.
  • Redo logging is one of the most crucial structure for recovery operations used in database technologies.
  • the redo log consists of two or more preallocated files that store all changes made to the database as they occur. Every instance of the database has an associated redo log to protect the database in case of an instance failure.
  • redo logging is generally avoided. Avoiding redo makes the bulk load operation faster. Since, we expect a large number of records to be loaded into the database through a bulk load, writing redo for each operation would make it slow and affect the throughput.
  • the data is loaded into the database and upon recovery or crash, the persistence of the data mostly depends on the check pointing of the data loaded.
  • the dependency on checkpoint for bulk load operation is very risky and may lead to data loss, for the reasons that includes but not limited to:
  • the checkpoint could be very slow, based on the number of pages dirtied, so there exists a possibility that all the data pages are dirted during bulk load has not been flushed;
  • a main objective of the present invention is to solve the problem of data loss after bulk load when checkpoint was not successful by avoiding the drawbacks of the conventional techniques as stated in the preceding section.
  • Another objective of the present invention is to provide a system, an apparatus and a method to overcome the loss of data during recovery of databases involving bulk load, where all data was not check pointed.
  • Another objective of the present invention is to provide a system, an apparatus and a method for performing an efficient bulk load into a database by avoiding data loss after a bulk load when checkpoint was not successful.
  • Another objective of the present invention is to a system, an apparatus and a method enables usage of local file system as a temporary cache in case of the csv files for bulk load and database server is in same machine.
  • Another objective of the present invention is to a system, an apparatus and a method enables usage of a distributed file system as a temporary cache of the csv file if the database server and the csv files are in different machines.
  • Yet another objective of the present invention is to provide a system, an apparatus and a method that provides minimalistic logging of the data path of the file during bulk load.
  • Yet another objective of the present invention is to provide a system, an apparatus and a method to enable reloading of the data into the database to be replayed during recovery if needed.
  • Yet another objective of the present invention is to provide a system, an apparatus and a method that enables the system views to support databases’dependency on the csv files.
  • Still another objective of the present invention is to a system, an apparatus and a method that avoids data loss when loading data in a scenario where unexpected crash or urgent shutdown of the machine/database management system (DBMS) is necessary for the database once the bulk load is completed.
  • DBMS machine/database management system
  • the present invention provides a mechanism to avoid data loss during recovery when the checkpoint was not successful i.e., when checkpoint failing, where the failed checkpoint was responsible for flushing the pages of bulk load to disk.
  • the present invention enables writing a redo log, for the bulk load, that contains the details of the file name, file path and other information. The user can keep the file in the location until the recovery operation or bulk load operation no longer requires the file. In case the redo log needs to be replayed, then the bulk load is redone, by the system or server automatically.
  • the present invention provides an apparatus for bulk loading the data into at least one database.
  • the apparatus comprises a processor, and a memory coupled to the processor for executing a plurality of modules present in the memory.
  • the plurality of modules comprises a bulk load module, and a recovery module.
  • the bulk load module is configured to receive at least one file associated information, the file stored in the memory from which the data is to be fetched for bulk loading; generate at least one redo log based on the file received, the redo log contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information; update system view of the file dependency based on the file received for bulk load; ensure persistence of the redo log flushed; initiate the bulk load into the database by allocating at least one fresh data page to store the records from the file received for bulk loading, thereby add the fresh data page allocated to a dirty queue list, wherein the fresh data page is allocated with at least a start marker and/or at least an end marker.
  • the recovery module configured to check consistency by checking the start marker and the end marker to ensure that the fresh data page/shave been flushed into the database, and thereby remove, if all the fresh data page/sare flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or delete, if all the fresh data page/sare not flushed, the fresh data page/sstarting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load.
  • the present invention provides a system for bulk loading the data into at least one database recited in at least one apparatus.
  • the system comprises a processor, and a memory coupled to the processor for executing a plurality of modules present in the memory.
  • the plurality of modules comprises a bulk load module and a recovery module.
  • the bulk load module is configured to transmit at least one file for storage in the apparatus, and thereby trigger the bulk load of the data from the file received into the database; generate at least one redo log based on the file transmitted, and transmit the redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information, to the apparatus; update system view of the file dependency based on the file received for bulk load, this is generally done by having a call back from the redo module based on the REDO type, in which the system view is updated; ensure persistence of the redo log flushed; initiate the bulk load into the database by allocating at least one fresh data page to store the records from the file received for bulk loading, thereby add the fresh data page generated to a dirty queue list, wherein the fresh data page is allocated with at least a start marker and/or at least an end marker.
  • the recovery module is configured to check consistency by checking the start marker and the end marker to ensure that the fresh data page/shave been flushed into the database, and thereby remove, if all the fresh data page/sare flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or delete, if all the fresh data page/sare not flushed, the fresh data page/sstarting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load.
  • the present invention provides a method for bulk loading the data into a database.
  • the method comprises:
  • the redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information;
  • the present invention eliminates the chances of the database data loss by removing the dependency on checkpoints.
  • the present invention has only one copy of the file foo. csv which is machine M. Further, the server in D can access the file foo. csv during recovery as well. Hence need to copy foo.
  • the present invention ensures that there is no explicit redo written for tuple loading.
  • the present invention enables to avoid explicit redo log writing for each tuple in the csv file.
  • the present invention enables a single redo log, thus making it very simple and covers all the data within the file. Further, it may be understood by the person skilled in that art that, the single redo written for the operation, contains many details for the file, that will help the database server to verify that the file has not been changed or tampered during the restart operation. These data could include, but not limited to, file length, CRC, last modified date etc.
  • the database server reads the redo log which contains these integrity check values, and then compares each of the values to those in the file now. If the values on the physical file are same as the one in the redo log, the file is considered genuine and same the one which was loaded earlier. If there is mismatch if even one of the values, the file is rendered tampered and the operation will terminate (or any suitable action could be taken) .
  • the present invention ensures that there is consistency and security check could be ensured to protect corrupt data load on restart.
  • Figure 1 illustrates an apparatus for bulk loading the data into at least one database, in accordance with an embodiment of the present subject matter.
  • Figure 2 illustrates a system for bulk loading the data into at least one database recited in at least one apparatus, in accordance with an embodiment of the present subject matter.
  • Figure 3 illustrates a method for bulk loading the data into a database, in accordance with an embodiment of the present subject matter.
  • Figure 4 illustrates a flow chart during the bulk load operation, in accordance with an embodiment of the present subject matter.
  • FIG. 5 illustrates a recovery flow, in accordance with an embodiment of the present subject matter.
  • the invention can be implemented in numerous ways, as a process, an apparatus, a system, a composition of matter, a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication links.
  • these implementations, or any other form that the invention may take, may be referred to as techniques.
  • the order of the steps of disclosed processes may be altered within the scope of the invention.
  • the present invention provides the present invention provides a mechanism to avoid data loss during recovery when the checkpoint was not successful i.e., when checkpoint failing, where the failed checkpoint was responsible for flushing the pages of bulk load to disk.
  • the present invention enables writing a redo log, for the bulk load, that contains the details of the file name, file path and other information. The user can keep the file in the location until the recovery operation or bulk load operation no longer requires the file. In case the redo log needs to be replayed, then the bulk load is redone, by the system or server automatically.
  • the present invention provides an apparatus 102 for bulk loading the data into at least one database 114, the apparatus that comprises a processor 104, and a memory 108 coupled to the processor for executing a plurality of modules stored in said memory 108.
  • the apparatus 102 may also be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, and the like. It will be understood that the apparatus 102 may be accessed by multiple users, or applications residing on the apparatus 102. Examples of the apparatus 102 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, sensors, routers, gateways and a workstation. The apparatus 102 are communicatively coupled to other devices or a nodes or apparatuses to form a network (not shown) .
  • the network may be a wireless network, a wired network or a combination thereof.
  • the network can be implemented as one of the different types of networks, such as GSM, CDMA, LTE, UMTS, intranet, local area network (LAN) , wide area network (WAN) , the internet, and the like.
  • the network may either be a dedicated network or a shared network.
  • the shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP) , Transmission Control Protocol/Internet Protocol (TCP/IP) , Wireless Application Protocol (WAP) , and the like, to communicate with one another.
  • HTTP Hypertext Transfer Protocol
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • WAP Wireless Application Protocol
  • the network may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
  • the apparatus 102 may include at least one processor 104, an interface 106, and a memory 108.
  • the at least one processor 104 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions.
  • the at least one processor 104 is configured to fetch and execute computer-readable instructions or modules stored in the memory 108.
  • the interface 106 may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like.
  • the I/O interface 106 may allow the apparatus 102 to interact with a user directly. Further, the I/O interface 106 may enable the apparatus 102 to communicate with other devices or a nodes, computing devices, such as web servers and external data servers (not shown) .
  • the I/O interface 106 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, GSM, CDMA, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite.
  • the I/O interface 106 may include one or more ports for connecting a number of devices to one another or to another server.
  • the I/O interface 106 may provide interaction between the user and the apparatus 102 via, a screen provided for the interface 106.
  • the memory 108 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM) , and/or non-volatile memory, such as read only memory (ROM) , erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • volatile memory such as static random access memory (SRAM) and dynamic random access memory (DRAM)
  • non-volatile memory such as read only memory (ROM) , erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • ROM read only memory
  • erasable programmable ROM erasable programmable ROM
  • flash memories such as compact flash drives, etc.
  • the plurality of modules may include but not limited to at least one bulk load module 110, and at least one recovery module 112.
  • the apparatus 102 comprises at least one bulk load module 110, and at least one recovery module 112.
  • the bulk load module is configured to transmit at least one file for storage in the apparatus, and thereby trigger the bulk load of the data from the file received into the database; generate at least one redo log based on the file transmitted, and transmit the redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information, to the apparatus; update system view of the file dependency based on the file received for bulk load, this is generally done by having a call back from the redo module based on the REDO type, in which the system view is updated; ensure persistence of the redo log flushed; initiate the bulk load into the database by allocating at least one fresh data page to store the records from the file received for bulk loading, thereby add the fresh data page generated to a dirty queue list, wherein the fresh data page is allocated with at least a start marker and/or at
  • the recovery module is configured to check consistency by checking the start marker and the end marker to ensure that the fresh data page/shave been flushed into the database, and thereby remove, if all the fresh data page/sare flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or delete, if all the fresh data page/sare not flushed, the fresh data page/sstarting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load.
  • a system 202 for bulk loading the data into at least one database 114 recited in at least one apparatus 102 comprises a processor 204, and a memory 208 coupled to the processor 204 for executing a plurality of modules present in the memory 208.
  • the plurality of modules comprises a bulk load module 210 and a recovery module 212.
  • the bulk load module is configured to transmit at least one file for storage in the apparatus, and thereby trigger the bulk load of the data from the file received into the database; generate at least one redo log based on the file transmitted, and transmit the redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information, to the apparatus; update system view of the file dependency based on the file received for bulk load, this is generally done by having a call back from the redo module based on the REDO type, in which the system view is updated; ensure persistence of the redo log flushed; initiate the bulk load into the database by allocating at least one fresh data page to store the records from the file received for bulk loading, thereby add the fresh data page generated to a dirty queue list, wherein the fresh data page is allocated with at least a start marker and/or at least an end marker.
  • the recovery module is configured to check consistency by checking the start marker and the end marker to ensure that the fresh data page/shave been flushed into the database, and thereby remove, if all the fresh data page/sare flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or delete, if all the fresh data page/sare not flushed, the fresh data page/sstarting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load.
  • processor 204 the memory 208, the interface 206 may have similar technical details equivalent to the processor 104, the memory 108, and the interface 106 explained in the preceding section.
  • the repetition of the details for the processor 204, the memory 208, and the interface 206 is avoided in this section. However, it may be understood that this should not limit the protection scope of the present invention.
  • the file is at least one comma separated values (csv) file or a file containing the database table.
  • csv comma separated values
  • At least one local file system residing in the apparatus 102 or the system 202 may be configured to operate as a temporary cache in case of the file for bulk load and the database recites in same machine.
  • At least one distributed file system in/or in the apparatus 102 or the system 202, may be configured to operate as a temporary cache in case of the file for bulk load and the database recites in different machines.
  • the bulk load module 110 /210 generates one or more fresh data pages to store the redo data from the redo log, and adds the fresh data pages generated to the dirty queue list at the end of the bulk load, thereby enables the apparatus to achieve an exclusivity in the bulk loading and the check pointing to ensure the bulk load changes are ready for checkpoint at a same instance.
  • the start marker and the end marker for the page/sbelonging to the bulk load are persisted as a part of checkpoint meta-data or at least one data structure, the start marker, the end marker, and the data structure enables to identify the flushing of the page/sduring a recovery operation.
  • start and end markers are like a list of page numbers, however not limited to a list, but any similar data structure.
  • These data pages allocated for storing are managed and tracked by the start and end marker.
  • This start and end marker is also flushed by the checkpoint (already covered in this doc) .
  • the checkpoint (already covered in this doc) .
  • the start and end marker is checked to ensure that all the pages have been flushed to disk and hence deduce that checkpoint was successful. If all pages have been flushed, the dependency on the bulk load file is removed and the system view is updated accordingly.
  • recovery module may assume that the checkpoint was not successfully. In such case recovery module removes the pages (starting from start marker) ; check the consistency (and other security aspect) of the csv file by using the redo data; update the dependency of the csv file, and redo the bulk load.
  • Figure 3 illustrates method for bulk loading the data into a database, in accordance with an embodiment of the present subject matter.
  • the method may be described in the general context of computer executable instructions.
  • computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types.
  • the method may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network.
  • computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.
  • the order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method or alternate methods. Additionally, individual blocks may be deleted from the method without departing from the protection scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof. However, for ease of explanation, in the embodiments described below, the method may be considered to be implemented in the above described apparatus 102 and/or the system 202.
  • the file associated information is received for storage in the storage means, and thereby bulk load the data from the file received into the database.
  • the bulk load utility generally may receive the filename as one of the inputs from the user. This filename (or file path) , along with other integrity details are sent to the DBMS.
  • the storage means may be the memory 108 or 208.
  • the redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information.
  • system view of the file dependency is updated based on the file received for bulk load. This is generally done by having a call back from the redo module based on the REDO type, in which the system view is updated;
  • the bulk loading into the database is initiated by allocating at least one fresh data page to store the records from the file received for bulk loading.
  • the fresh data page generated are added to a dirty queue list, wherein said fresh data page is allocated with at least a start marker and/or at least an end marker is added. In one implementation, there more than two fresh data pages are generated and assigned with start marker and/or end marker.
  • recovery module will assume that the checkpoint did not happen successfully. In this case it will remove the pages (starting from start marker) ; check the consistency (and other security aspect) of the csv file by using the redo data; update the dependency of the csv file, and redo the bulk load.
  • one or more fresh data pages are generated to store the redo data from the redo log, and adding the fresh data pages generated to the dirty queue list at the end of the bulk load, thereby achieving exclusivity in the bulk loading and the check pointing to ensure the bulk load changes are ready for checkpoint at a same instance.
  • start and end markers are like a list of page numbers, however not limited to a list, but any similar data structure.
  • These data pages allocated for storing are managed and tracked by the start and end marker.
  • This start and end marker is also flushed by the checkpoint (already covered in this doc) .
  • the checkpoint (already covered in this doc) .
  • the start and end marker is checked to ensure that all the pages have been flushed to disk and hence deduce that checkpoint was successful. If all pages have been flushed, the dependency on the bulk load file is removed and the system view is updated accordingly.
  • recovery module may assume that the checkpoint was not successfully. In such case recovery module removes the pages (starting from start marker) ; check the consistency (and other security aspect) of the csv file by using the redo data; update the dependency of the csv file, and redo the bulk load.
  • FIG 4 illustrates a flow chart during the bulk load operation, in accordance with an embodiment of the present subject matter.
  • the application responsible for generating csv file generates the csv file and stores the csv file in the local or distributed file system, for bulking the csv file into the database. It may be understood by the person skilled in the art that, the above mentioned steps are known in the prior-art.
  • the present invention when the csv file is loaded in the database, there is a possibility the checkpoint was not successful causing data loss during recovery. Thus, to avoid this when checkpoint fails, and where the failed checkpoint was responsible for flushing the pages of bulk load to disk, the present invention write a redo log, for the bulk load, that contains the details of the file name, file path and other information. It may be understood by the person skilled in that art that, the file properties that helps to identify the tampering of file are written, and accordingly updates the system views of the bulk file dependencies.
  • checkpoint may have been triggered in the database due to reasons that may include but not limited to expiry of timeout interval, user triggered checkpoint, checkpoint triggered by another bulk load operation, and the like.
  • the present invention provides a mechanism to change the page allocation strategy and the time when the pages get added to the dirty list.
  • there may be two ways by using which the pages could be used for “putting” the records in a page One is by using use an existing page, which has sufficient space for the record and the other is to allocate a fresh page and use it exclusively for the records of the bulk load.
  • the present invention may use the second one however, shall not be restricted to the usage of the same.
  • a key advantage of the second method is that, the existing pages are kept intact and thus are safe from bulk load changes, and hence may be preferably used. Also, by using fresh pages we get more flexibility to treat the pages as one unit and can tune checkpoint system to handle these pages separately/specially.
  • the present invention enables all the bulk loads to use a fresh data page for storing the tuples.
  • the fresh pages used for bulk load will be added to the dirty list only at the end of the bulk load operation (in one shot) .
  • the present invention enable to achieve the exclusivity that the bulk loading and the checkpoint and all the bulk load changes are ready for checkpoint at the same time.
  • a start and end markers for the set of pages (fresh pages) belonging to a bulk load be persisted are allocated as a part of checkpoint meta-data or any other data structure.
  • the meta data or data structure may be referred to an infrastructure through which the recovery module understands the state of the system when checkpoint happened. The data structure and marker helps to identify if all the pages of bulk load has been flushed, which will be evaluated during recovery.
  • the file consistency and availability is checked if the data needs to be loaded from the csv file.
  • the start and end markers flushed by the checkpoint is used to validate if the checkpoint is successful. If the start and end markers are present, the system view relevant to bulk load is updated, and the bulk data is flushed completely in the database. If the start and end markers are not present, the pages (related to bulk load) from start mark (to end marker) is deleted and the operation is redone.
  • the present invention avoids the data loss when loading data in a scenario where unexpected crash or urgent shutdown of the DBMS is necessary for the database once the bulk load is completed.
  • the present invention eliminate the chances of database data loss by removing the dependency on checkpoint.
  • the present invention enables to achieve no redundancy of data files needed.
  • the present invention enables to achieve no explicit redo written for tuple loading.
  • the present invention enables to achieve consistency and security check could be ensured to protect corrupt data load on restart.
  • the present invention enables the system view or similar interfaces/framework exposed from the server shows the mechanism dependency on the csv file.
  • the present invention enables to remove the files from the said location, or make it in-accessible to the server during the server restart.
  • the present invention enables to replace the file with corrupt data and check for data reload
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the described apparatus embodiment is merely exemplary.
  • the unit division is merely logical function division and may be other division in actual implementation.
  • a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed.
  • the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces.
  • the indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
  • the functions When the functions are implemented in a form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of the present invention essentially, or the part contributing to the prior art, or a part of the technical solutions may be implemented in a form of a software product.
  • the computer software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, or a network device) to perform all or a part of the steps of the methods described in the embodiment of the present invention.
  • the foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM) , a random access memory (Random Access Memory, RAM) , a magnetic disk, or an optical disc.
  • program code such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM) , a random access memory (Random Access Memory, RAM) , a magnetic disk, or an optical disc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A mechanism to avoid data loss during recovery when the checkpoint was not successful i.e., when checkpoint failing, where the failed checkpoint was responsible for flushing the pages of bulk load to disk is provided. To achieve the avoiding of the data loss during recovery, the above mechanism enables writing a redo log, for the bulk load, that contains the details of the file name, file path and other information. The user can keep the file in the location until the recovery operation or bulk load operation no longer requires the file. In case the redo log needs to be replayed, then the bulk load is redone, by the system or server automatically.

Description

SYSTEM, APPARATUS AND METHOD FOR AVOIDING DATA LOSS ON RECOVERY OF BULK LOAD
This application is a continuation of International Application IN6989/CHE/2015, filed on December 28, 2015. The disclosures of the aforementioned application are hereby incorporated by reference in their entireties.
TECHNICAL FIELD
The present subject matter described herein, in general, relates to bulk loading of databases, and more particularly, to systems, apparatuses and methods for performing an efficient bulk load into a database by avoiding data loss after a bulk load when checkpoint was not successful.
BACKGROUND
Databases are widely used technical means for storing and handling large amounts of data. Due to the increasing complications of databases the efficiency and reliability of the databases are of prime importance. An unexpected crash or inconsistency in a database can result in enormous losses not only because of the time necessary to restore the database from a backup, but in the worst case also in the loss of data.
The applications generally use two commonly known techniques of inserting data into the database. One of the technique is using a traditional sequential query language (SQL) model, wherein the SQL insert queries are used for inserting the data. The other technique is using a bulk load utility/operation provided by many of the database vendors, wherein the data is directly loaded from delimiter separated value files, for e.g. comma separated values (csv) . Bulk load or bulk insert is a common mechanism of automating the process of loading a large amount of data into the database. The bulk insert is a process or method provided by a database management system to load multiple rows of data into a  database table. Out of the two techniques, the bulk load is widely used technique because most of the databases have tuned the bulk load process/operation to make it fast and efficient by adapting to various operational techniques such as no redo logging, parallel loading, online index update, etc.
Redo logging is one of the most crucial structure for recovery operations used in database technologies. The redo log consists of two or more preallocated files that store all changes made to the database as they occur. Every instance of the database has an associated redo log to protect the database in case of an instance failure. However, to make the bulk load operations fast and efficient, redo logging is generally avoided. Avoiding redo makes the bulk load operation faster. Since, we expect a large number of records to be loaded into the database through a bulk load, writing redo for each operation would make it slow and affect the throughput.
Further, in the conventional techniques, the data is loaded into the database and upon recovery or crash, the persistence of the data mostly depends on the check pointing of the data loaded. The dependency on checkpoint for bulk load operation is very risky and may lead to data loss, for the reasons that includes but not limited to:
i. the checkpoint could be very slow, based on the number of pages dirtied, so there exists a possibility that all the data pages are dirted during bulk load has not been flushed;
ii. replay of operations after bulk load cannot be replayed if bulk load is not successfully check pointed;
iii. as redo logs are not written and hence there are no logs to replay during recovery, there is a high possibility that the bulk load operation performed prior to the crash/shutdown cannot be recovered;
iv. in cases where the redo logs are written (and flushed) , the recovery (and in turn the amount of data loss) depends on the redo logs available rather than on the checkpoint. However, in bulk load cases, the redo is not written,  and thus there is a heavy dependency that checkpoint happens for all the dirtied pages (of bulk load) . If the checkpoint does not happen, it will result in loss of data/inconsistency of the data loaded.
SUMMARY
This summary is provided to introduce concepts related to system, apparatus and method for avoiding data loss upon recovery of bulk load, the same are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.
A main objective of the present invention is to solve the problem of data loss after bulk load when checkpoint was not successful by avoiding the drawbacks of the conventional techniques as stated in the preceding section.
Another objective of the present invention is to provide a system, an apparatus and a method to overcome the loss of data during recovery of databases involving bulk load, where all data was not check pointed.
Another objective of the present invention is to provide a system, an apparatus and a method for performing an efficient bulk load into a database by avoiding data loss after a bulk load when checkpoint was not successful.
Another objective of the present invention is to a system, an apparatus and a method enables usage of local file system as a temporary cache in case of the csv files for bulk load and database server is in same machine.
Another objective of the present invention is to a system, an apparatus and a method enables usage of a distributed file system as a temporary  cache of the csv file if the database server and the csv files are in different machines.
Yet another objective of the present invention is to provide a system, an apparatus and a method that provides minimalistic logging of the data path of the file during bulk load.
Yet another objective of the present invention is to provide a system, an apparatus and a method to enable reloading of the data into the database to be replayed during recovery if needed.
Yet another objective of the present invention is to provide a system, an apparatus and a method that enables the system views to support databases’dependency on the csv files.
Still another objective of the present invention is to a system, an apparatus and a method that avoids data loss when loading data in a scenario where unexpected crash or urgent shutdown of the machine/database management system (DBMS) is necessary for the database once the bulk load is completed.
In order to provide a technical solution to the technical problems mentioned in the preceding section, the present invention provides a mechanism to avoid data loss during recovery when the checkpoint was not successful i.e., when checkpoint failing, where the failed checkpoint was responsible for flushing the pages of bulk load to disk. To achieve the avoiding of the data loss during recovery, the present invention enables writing a redo log, for the bulk load, that contains the details of the file name, file path and other information. The user can keep the file in the location until the recovery operation or bulk load operation no longer requires the file. In case the redo log needs to be replayed, then the bulk load is redone, by the system or server automatically.
Accordingly, in one implementation, the present invention provides an apparatus for bulk loading the data into at least one database. The apparatus comprises a processor, and a memory coupled to the processor for executing a plurality of modules present in the memory. The plurality of modules comprises a bulk load module, and a recovery module. The bulk load module is configured to receive at least one file associated information, the file stored in the memory from which the data is to be fetched for bulk loading; generate at least one redo log based on the file received, the redo log contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information; update system view of the file dependency based on the file received for bulk load; ensure persistence of the redo log flushed; initiate the bulk load into the database by allocating at least one fresh data page to store the records from the file received for bulk loading, thereby add the fresh data page allocated to a dirty queue list, wherein the fresh data page is allocated with at least a start marker and/or at least an end marker. The recovery module configured to check consistency by checking the start marker and the end marker to ensure that the fresh data page/shave been flushed into the database, and thereby remove, if all the fresh data page/sare flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or delete, if all the fresh data page/sare not flushed, the fresh data page/sstarting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load.
In one implementation, the present invention provides a system for bulk loading the data into at least one database recited in at least one apparatus. The system comprises a processor, and a memory coupled to the processor for executing a plurality of modules present in the memory. The plurality of modules comprises a bulk load module and a recovery module. The bulk load module is  configured to transmit at least one file for storage in the apparatus, and thereby trigger the bulk load of the data from the file received into the database; generate at least one redo log based on the file transmitted, and transmit the redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information, to the apparatus; update system view of the file dependency based on the file received for bulk load, this is generally done by having a call back from the redo module based on the REDO type, in which the system view is updated; ensure persistence of the redo log flushed; initiate the bulk load into the database by allocating at least one fresh data page to store the records from the file received for bulk loading, thereby add the fresh data page generated to a dirty queue list, wherein the fresh data page is allocated with at least a start marker and/or at least an end marker. The recovery module is configured to check consistency by checking the start marker and the end marker to ensure that the fresh data page/shave been flushed into the database, and thereby remove, if all the fresh data page/sare flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or delete, if all the fresh data page/sare not flushed, the fresh data page/sstarting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load.
In one implementation, the present invention provides a method for bulk loading the data into a database. The method comprises:
● receiving at least one file associated information, the file stored in the memory from which the data is to be fetched for bulk loading;
● generating at least one redo log based on the file received, the redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information;
● updating system view of the file dependency based on the file received for bulk load, this is generally done by having a call back from the redo module based on the REDO type, in which the system view is updated;
● ensuring persistence of the redo log flushed;
● initiating the bulk load into the database by allocating at least one fresh data page to store the records from the file received for bulk loading, thereby
● adding the fresh data page generated to a dirty queue list, wherein said fresh data page is allocated with at least a start marker and/or at least an end marker;
● checking consistency by checking the start marker and the end marker to ensure that the fresh data page/shave been flushed into the database, thereby:
● removing, if all the fresh data page/sare flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or
● delete, if all the fresh data page/sare not flushed, the fresh data page/sstarting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load.
In contrast to the prior-art techniques available, the present invention eliminates the chances of the database data loss by removing the dependency on checkpoints. The present invention ensures that there is no redundancy of data files (the csv file that was loaded by bulk load) needed. For example, if a file foo. csv is to be loaded from a particular machine M (Let database server run in machine D. It is not necessary for the present invention that D==M or D! =M. works in any case) . As compared to the prior-art, the present invention has only one copy of the file foo. csv which is machine M. Further, the server in D can access the file foo. csv during recovery as well. Hence need to copy foo. csv to machine D is avoided. The present invention ensures that there is  no explicit redo written for tuple loading. The present invention enables to avoid explicit redo log writing for each tuple in the csv file. The present invention enables a single redo log, thus making it very simple and covers all the data within the file. Further, it may be understood by the person skilled in that art that, the single redo written for the operation, contains many details for the file, that will help the database server to verify that the file has not been changed or tampered during the restart operation. These data could include, but not limited to, file length, CRC, last modified date etc. During recovery, the database server reads the redo log which contains these integrity check values, and then compares each of the values to those in the file now. If the values on the physical file are same as the one in the redo log, the file is considered genuine and same the one which was loaded earlier. If there is mismatch if even one of the values, the file is rendered tampered and the operation will terminate (or any suitable action could be taken) . The present invention ensures that there is consistency and security check could be ensured to protect corrupt data load on restart.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit (s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.
Figure 1 illustrates an apparatus for bulk loading the data into at least one database, in accordance with an embodiment of the present subject matter.
Figure 2 illustrates a system for bulk loading the data into at least one database recited in at least one apparatus, in accordance with an embodiment of the present subject matter.
Figure 3 illustrates a method for bulk loading the data into a database, in accordance with an embodiment of the present subject matter.
Figure 4 illustrates a flow chart during the bulk load operation, in accordance with an embodiment of the present subject matter.
Figure 5 illustrates a recovery flow, in accordance with an embodiment of the present subject matter.
It is to be understood that the attached drawings are for purposes of illustrating the concepts of the invention and may not be to scale.
DETAILED DESCRIPTION OF THE PRESENT INVENTION
The following clearly describes the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Apparently, the described embodiments are merely a part rather than all of the embodiments of the present invention. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts shall fall within the protection scope of the present invention.
The invention can be implemented in numerous ways, as a process, an apparatus, a system, a composition of matter, a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication links. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
Systems, apparatuses, and methods for avoiding data loss upon recovery of bulk load are disclosed.
While aspects are described for system, apparatus and method for avoiding data loss upon recovery of bulk load, the present invention may be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary systems, apparatus, and methods.
In one implementation, the present invention provides the present invention provides a mechanism to avoid data loss during recovery when the checkpoint was not successful i.e., when checkpoint failing, where the failed checkpoint was responsible for flushing the pages of bulk load to disk. To achieve the avoiding of the data loss during recovery, the present invention enables writing a redo log, for the bulk load, that contains the details of the file name, file path and other information. The user can keep the file in the location until the recovery operation or bulk load operation no longer requires the file. In case the  redo log needs to be replayed, then the bulk load is redone, by the system or server automatically.
Referring now to figure 1, an apparatus for bulk loading the data into at least one database, is illustrated, in accordance with an embodiment of the present subject matter. In one implementation, the present invention provides an apparatus 102 for bulk loading the data into at least one database 114, the apparatus that comprises a processor 104, and a memory 108 coupled to the processor for executing a plurality of modules stored in said memory 108.
Although the present subject matter is explained considering that the present invention is implemented as apparatus 102, it may be understood that the apparatus 102 may also be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, and the like. It will be understood that the apparatus 102 may be accessed by multiple users, or applications residing on the apparatus 102. Examples of the apparatus 102 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, sensors, routers, gateways and a workstation. The apparatus 102 are communicatively coupled to other devices or a nodes or apparatuses to form a network (not shown) .
In one implementation, the network (not shown) may be a wireless network, a wired network or a combination thereof. The network can be implemented as one of the different types of networks, such as GSM, CDMA, LTE, UMTS, intranet, local area network (LAN) , wide area network (WAN) , the internet, and the like. The network may either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP) , Transmission Control Protocol/Internet Protocol (TCP/IP) , Wireless Application Protocol (WAP) , and the like, to communicate with one another.  Further the network may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
The apparatus 102 as illustrated in accordance with an embodiment of the present subject matter, may include at least one processor 104, an interface 106, and a memory 108. The at least one processor 104 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the at least one processor 104 is configured to fetch and execute computer-readable instructions or modules stored in the memory 108.
The interface 106 (I/O interface) may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface 106 may allow the apparatus 102 to interact with a user directly. Further, the I/O interface 106 may enable the apparatus 102 to communicate with other devices or a nodes, computing devices, such as web servers and external data servers (not shown) . The I/O interface 106 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, GSM, CDMA, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The I/O interface 106 may include one or more ports for connecting a number of devices to one another or to another server. The I/O interface 106 may provide interaction between the user and the apparatus 102 via, a screen provided for the interface 106.
The memory 108 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM) , and/or non-volatile memory, such as read only memory (ROM) , erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 108 may include plurality of instructions or modules or applications to perform  various functionalities. The memory 108 includes routines, programs, objects, components, data structures, etc., which perform particular tasks or implement particular abstract data types.
In one implementation, the plurality of modules may include but not limited to at least one bulk load module 110, and at least one recovery module 112.
In one implementation, the apparatus 102 is disclosed. The apparatus 102 comprises at least one bulk load module 110, and at least one recovery module 112. The bulk load module is configured to transmit at least one file for storage in the apparatus, and thereby trigger the bulk load of the data from the file received into the database; generate at least one redo log based on the file transmitted, and transmit the redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information, to the apparatus; update system view of the file dependency based on the file received for bulk load, this is generally done by having a call back from the redo module based on the REDO type, in which the system view is updated; ensure persistence of the redo log flushed; initiate the bulk load into the database by allocating at least one fresh data page to store the records from the file received for bulk loading, thereby add the fresh data page generated to a dirty queue list, wherein the fresh data page is allocated with at least a start marker and/or at least an end marker. The recovery module is configured to check consistency by checking the start marker and the end marker to ensure that the fresh data page/shave been flushed into the database, and thereby remove, if all the fresh data page/sare flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or delete, if all the fresh data page/sare not flushed, the fresh data page/sstarting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load. 
Referring now to figure 2, a system 202 for bulk loading the data into at least one database 114 recited in at least one apparatus 102. The system 202 comprises a processor 204, and a memory 208 coupled to the processor 204 for executing a plurality of modules present in the memory 208. The plurality of modules comprises a bulk load module 210 and a recovery module 212. The bulk load module is configured to transmit at least one file for storage in the apparatus, and thereby trigger the bulk load of the data from the file received into the database; generate at least one redo log based on the file transmitted, and transmit the redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information, to the apparatus; update system view of the file dependency based on the file received for bulk load, this is generally done by having a call back from the redo module based on the REDO type, in which the system view is updated; ensure persistence of the redo log flushed; initiate the bulk load into the database by allocating at least one fresh data page to store the records from the file received for bulk loading, thereby add the fresh data page generated to a dirty queue list, wherein the fresh data page is allocated with at least a start marker and/or at least an end marker. The recovery module is configured to check consistency by checking the start marker and the end marker to ensure that the fresh data page/shave been flushed into the database, and thereby remove, if all the fresh data page/sare flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or delete, if all the fresh data page/sare not flushed, the fresh data page/sstarting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load.
It may be understood by the person skilled in that art that the processor 204, the memory 208, the interface 206 may have similar technical details equivalent to the processor 104, the memory 108, and the interface 106  explained in the preceding section. In order to avoid confusion and complexity in understanding the core invention, the repetition of the details for the processor 204, the memory 208, and the interface 206 is avoided in this section. However, it may be understood that this should not limit the protection scope of the present invention.
In one implementation, the file is at least one comma separated values (csv) file or a file containing the database table. However, it may understood by the person skilled in that art that any existing mechanism of loading data using a file may be used in the present invention.
In one implementation, at least one local file system residing in the apparatus 102 or the system 202 may be configured to operate as a temporary cache in case of the file for bulk load and the database recites in same machine.
In one implementation, at least one distributed file system, in/or in the apparatus 102 or the system 202, may be configured to operate as a temporary cache in case of the file for bulk load and the database recites in different machines.
In one implementation, the bulk load module 110 /210 generates one or more fresh data pages to store the redo data from the redo log, and adds the fresh data pages generated to the dirty queue list at the end of the bulk load, thereby enables the apparatus to achieve an exclusivity in the bulk loading and the check pointing to ensure the bulk load changes are ready for checkpoint at a same instance.
In one implementation, the start marker and the end marker for the page/sbelonging to the bulk load are persisted as a part of checkpoint meta-data or at least one data structure, the start marker, the end marker, and the data structure enables to identify the flushing of the page/sduring a recovery operation. 
The person skilled in the art may understand that the start and end markers are like a list of page numbers, however not limited to a list, but any similar data structure. In one implementation, there may be a requirement of a plurality of data pages to store the records from the bulk load. These data pages allocated for storing are managed and tracked by the start and end marker. This start and end marker is also flushed by the checkpoint (already covered in this doc) . Upon recover operation of the database, when needed/as per settings, the start and end marker is checked to ensure that all the pages have been flushed to disk and hence deduce that checkpoint was successful. If all pages have been flushed, the dependency on the bulk load file is removed and the system view is updated accordingly. However, if all the pages are not flushed or partially flushed, then the recovery module may assume that the checkpoint was not successfully. In such case recovery module removes the pages (starting from start marker) ; check the consistency (and other security aspect) of the csv file by using the redo data; update the dependency of the csv file, and redo the bulk load.
Figure 3 illustrates method for bulk loading the data into a database, in accordance with an embodiment of the present subject matter. The method may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices.
The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method or alternate methods. Additionally, individual blocks may be deleted from the method without departing from the protection scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof. However, for ease of explanation, in the embodiments described below, the method may be considered to be implemented in the above described apparatus 102 and/or the system 202.
At block 302, the file associated information is received for storage in the storage means, and thereby bulk load the data from the file received into the database. In one implementation, the bulk load utility generally may receive the filename as one of the inputs from the user. This filename (or file path) , along with other integrity details are sent to the DBMS. In one implementation the storage means may be the  memory  108 or 208.
At block 304, at least one redo log, based on the file received, is generated. The redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information.
At block 306, system view of the file dependency is updated based on the file received for bulk load. This is generally done by having a call back from the redo module based on the REDO type, in which the system view is updated;
At block 308, persistence of the redo log flushed is ensured.
At block 310, the bulk loading into the database is initiated by allocating at least one fresh data page to store the records from the file received for bulk loading.
At block 312, the fresh data page generated are added to a dirty queue list, wherein said fresh data page is allocated with at least a start marker and/or at least an end marker is added. In one implementation, there more than two fresh data pages are generated and assigned with start marker and/or end marker.
At block 314, upon recover, when needed, the start and end marker is checked to ensure that all the pages have been flushed to disk and hence deduce that checkpoint was successful.
At block 316, if all pages have been flushed, then remove the dependency on the bulk load file and update the system view, or alternatively
At block 318, if all the pages part of start and end marker is not flushed, then recovery module will assume that the checkpoint did not happen successfully. In this case it will remove the pages (starting from start marker) ; check the consistency (and other security aspect) of the csv file by using the redo data; update the dependency of the csv file, and redo the bulk load.
In one implementation, one or more fresh data pages are generated to store the redo data from the redo log, and adding the fresh data pages generated to the dirty queue list at the end of the bulk load, thereby achieving exclusivity in the bulk loading and the check pointing to ensure the bulk load changes are ready for checkpoint at a same instance.
The person skilled in the art may understand that the start and end markers are like a list of page numbers, however not limited to a list, but any  similar data structure. In one implementation, there may be a requirement of a plurality of data pages to store the records from the bulk load. These data pages allocated for storing are managed and tracked by the start and end marker. This start and end marker is also flushed by the checkpoint (already covered in this doc) . Upon recover operation of the database, when needed/as per settings, the start and end marker is checked to ensure that all the pages have been flushed to disk and hence deduce that checkpoint was successful. If all pages have been flushed, the dependency on the bulk load file is removed and the system view is updated accordingly. However, if all the pages are not flushed or partially flushed, then the recovery module may assume that the checkpoint was not successfully. In such case recovery module removes the pages (starting from start marker) ; check the consistency (and other security aspect) of the csv file by using the redo data; update the dependency of the csv file, and redo the bulk load.
Referring now to figure 4 illustrates a flow chart during the bulk load operation, in accordance with an embodiment of the present subject matter. In one implementation, as shown in figure 4, the application responsible for generating csv file generates the csv file and stores the csv file in the local or distributed file system, for bulking the csv file into the database. It may be understood by the person skilled in the art that, the above mentioned steps are known in the prior-art.
In accordance with the present invention, when the csv file is loaded in the database, there is a possibility the checkpoint was not successful causing data loss during recovery. Thus, to avoid this when checkpoint fails, and where the failed checkpoint was responsible for flushing the pages of bulk load to disk, the present invention write a redo log, for the bulk load, that contains the details of the file name, file path and other information. It may be understood by the person skilled in that art that, the file properties that helps to identify the tampering of file are written, and accordingly updates the system views of the bulk file dependencies.
In the meanwhile, there is a possibility that checkpoint may have been triggered in the database due to reasons that may include but not limited to expiry of timeout interval, user triggered checkpoint, checkpoint triggered by another bulk load operation, and the like.
Hence, to protect and flush consistent data to disk, the present invention provides a mechanism to change the page allocation strategy and the time when the pages get added to the dirty list. In one implementation, there may be two ways by using which the pages could be used for “putting” the records in a page. One is by using use an existing page, which has sufficient space for the record and the other is to allocate a fresh page and use it exclusively for the records of the bulk load. The present invention may use the second one however, shall not be restricted to the usage of the same. A key advantage of the second method is that, the existing pages are kept intact and thus are safe from bulk load changes, and hence may be preferably used. Also, by using fresh pages we get more flexibility to treat the pages as one unit and can tune checkpoint system to handle these pages separately/specially. The present invention enables all the bulk loads to use a fresh data page for storing the tuples. The fresh pages used for bulk load will be added to the dirty list only at the end of the bulk load operation (in one shot) . By doing this, the present invention enable to achieve the exclusivity that the bulk loading and the checkpoint and all the bulk load changes are ready for checkpoint at the same time.
In one implementation, during checkpoint operation, to achieve an efficient bulk loading, all the pages (fresh pages) of a bulk load must be flushed to the disk as a single atomic unit. To achieve this, according to the present invention, a start and end markers for the set of pages (fresh pages) belonging to a bulk load be persisted are allocated as a part of checkpoint meta-data or any other data structure. In one implementation, the meta data or data structure may be referred to an infrastructure through which the recovery module understands the state of  the system when checkpoint happened. The data structure and marker helps to identify if all the pages of bulk load has been flushed, which will be evaluated during recovery.
Referring now to figure 5 illustrates a recovery flow, in accordance with an embodiment of the present subject matter. In one implementation, the file consistency and availability is checked if the data needs to be loaded from the csv file. According to the present invention, the start and end markers flushed by the checkpoint, is used to validate if the checkpoint is successful. If the start and end markers are present, the system view relevant to bulk load is updated, and the bulk data is flushed completely in the database. If the start and end markers are not present, the pages (related to bulk load) from start mark (to end marker) is deleted and the operation is redone.
In one implementation, the present invention avoids the data loss when loading data in a scenario where unexpected crash or urgent shutdown of the DBMS is necessary for the database once the bulk load is completed.
Apart from what is explained above, the present invention also include the below mentioned advantages:
√ The present invention, eliminate the chances of database data loss by removing the dependency on checkpoint.
√ The present invention enables to achieve no redundancy of data files needed.
√ The present invention enables to achieve no explicit redo written for tuple loading.
√ The present invention enables to achieve consistency and security check could be ensured to protect corrupt data load on restart.
√ The present invention enables the system view or similar interfaces/framework exposed from the server shows the mechanism dependency on the csv file.
√ The present invention enables to remove the files from the said location, or make it in-accessible to the server during the server restart.
√ The present invention enables to replace the file with corrupt data and check for data reload
A person of ordinary skill in the art may be aware that in combination with the examples described in the embodiments disclosed in this specification, units and algorithm steps may be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on the particular applications and design constraint conditions of the technical solution. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of the present invention.
It may be clearly understood by a person skilled in the art that for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, reference may be made to a corresponding process in the foregoing method embodiments, and details are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings  or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
When the functions are implemented in a form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of the present invention essentially, or the part contributing to the prior art, or a part of the technical solutions may be implemented in a form of a software product. The computer software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, or a network device) to perform all or a part of the steps of the methods described in the embodiment of the present invention. The foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM) , a random access memory (Random Access Memory, RAM) , a magnetic disk, or an optical disc.
Although implementations for system, apparatus and method for avoiding data loss upon recovery of bulk load have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as examples of implementations of the system, apparatus and method for avoiding data loss upon recovery of bulk load.

Claims (13)

  1. An apparatus 102 for bulk loading the data into at least one database 114, the apparatus 102 comprising:
    a processor 104; and
    a memory 108 coupled to the processor 104 for executing a plurality of modules present in the memory 108, the plurality of modules comprising:
    a bulk load module 110 configured to:
    receive at least one file associated information, the file stored in the memory from which the data is to be fetched for bulk loading;
    generate at least one redo log based on the file, the redo log contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information;
    update system view of the file dependency based on the redo log generated;
    ensure persistence of the redo log flushed;
    initiate the bulk load into the database by allocating at least one fresh data page for storing the records from the file received for bulk loading, thereby
    add the fresh data page generated to a dirty queue list, wherein the fresh data page is allocated with at least a start marker and/or at least an end marker; and
    a recovery module 112 configured to check consistency by checking the start marker and the end marker to ensure that the fresh data page/s have been flushed into the database, thereby:
    remove, if all the fresh data page/s are flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or
    delete, if all the fresh data page/s are not flushed, the fresh data page/s starting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load.
  2. The apparatus as claimed in claim 1, wherein the file is at least one comma separated values (csv) file or a file containing the database table.
  3. The apparatus as claimed in claim 1, comprises at least one local file system is configured to operate as a temporary cache in case of the file for bulk load and the database recites in same machine.
  4. The apparatus as claimed in claim 1, comprises at least one distributed file system is configured to operate as a temporary cache in case of the file for bulk load and the database resides in different machines.
  5. The apparatus as claimed in claim 1, wherein bulk load module generates one or more fresh data pages to store the redo data from the redo log, and adds the fresh data pages generated to the dirty queue list at the end of the bulk load, thereby enables the apparatus to achieve an exclusivity in the bulk loading and the check pointing to ensure the bulk load changes are ready for checkpoint at a same instance.
  6. The apparatus as claimed in claim 1, wherein the start marker and the end marker for the page/s belonging to the bulk load are persisted as a part of checkpoint meta-data or at least one data structure, the start marker, the end marker, and the data structure enables to identify the flushing of the page/s during a recovery operation.
  7. A method for bulk loading the data into a database, comprising:
    receiving 302 at least one file associated information, the file stored in the memory from which the data is to be fetched for bulk loading;
    generating 304 at least one redo log based on the file received, the redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information;
    updating 306 system view of the file dependency based on the file received for bulk load;
    ensuring 308 persistence of the redo log flushed;
    initiating 310 the bulk load into the database by allocating at least one fresh data page to store the records from the file received for bulk loading, thereby
    adding 312 the fresh data page generated to a dirty queue list, wherein said page is allocated with at least a start marker and/or at least an end marker;
    checking 314 consistency by checking the start marker and the end marker to ensure that the fresh data page/s have been flushed into the database, thereby:
    removing 316, if all the fresh data page/s are flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or
    delete 318 if all the fresh data page/s are not flushed, the fresh data page/s starting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load.
  8. The method as claimed in claim 7, wherein the file is at least one comma separated values (csv) file or a file containing the database table.
  9. The method as claimed in claim 7, comprises using local file system as a temporary cache in case of the file for bulk load and the database recites in same machine.
  10. The method as claimed in claim 7, comprises using a distributed file system as a temporary cache in case of the file for bulk load and the database recites in different machines.
  11. The method as claimed in claim 7, comprises: generating one or more fresh data pages to store the redo data from the redo log, and adding the fresh data pages generated to the dirty queue list at the end of the bulk load, thereby achieving exclusivity in the bulk loading and the check pointing to ensure the bulk load changes are ready for checkpoint at a same instance.
  12. The method as claimed in claim 7, wherein the start marker and the end marker for the page/s belonging to the bulk load are persisted as a part of checkpoint meta-data or at least one data structure, the start marker, the end marker, and the data structure enables to identify the flushing of the page/s during a recovery operation.
  13. A system 202 for bulk loading the data into at least one database 114 recited in at least one apparatus 102, the system comprising:
    a processor 204; and
    a memory 208 coupled to the processor for executing a plurality of modules present in the memory 204, the plurality of modules comprising:
    a bulk load module 210 configured to:
    transmit at least one file for storage in the apparatus 102, and thereby trigger the bulk load of the data from the file received into the database 114;
    generate at least one redo log based on the file transmitted, and transmit the redo file contain the redo data associated with at least two or more preallocated files that store all changes made to the database as they occur, a group of change vectors, and the other information, to the apparatus;
    update system view of the file dependency based on the file received for bulk load;
    ensure persistence of the redo log flushed;
    initiate the bulk load into the database by allocating at least one fresh data page to store the records from the file received for bulk loading, thereby
    add the fresh data page generated to a dirty queue list, wherein said page is allocated with at least a start marker and/or at least an end marker; and
    a recovery module 212 configured to check consistency by checking the start marker and the end marker to ensure that the fresh data page/s have been flushed into the database, thereby:
    remove, if all the fresh data page/s are flushed, the dependency on the file received for bulk loading, and accordingly update the system view; or
    delete, if all the fresh data page/s are not flushed, the fresh data page/s starting from the start marker to the end marker by simultaneously checking the consistency of the file received for bulk loading using the redo log, and accordingly update the dependency of the file received for bulk loading; and thereafter redo the bulk load.
PCT/CN2016/111335 2015-12-28 2016-12-21 System, apparatus and method for avoiding data loss on recovery of bulk load WO2017114263A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN6989/CHE/2015 2015-12-28
IN6989CH2015 2015-12-28

Publications (1)

Publication Number Publication Date
WO2017114263A1 true WO2017114263A1 (en) 2017-07-06

Family

ID=59225592

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/111335 WO2017114263A1 (en) 2015-12-28 2016-12-21 System, apparatus and method for avoiding data loss on recovery of bulk load

Country Status (1)

Country Link
WO (1) WO2017114263A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966533A (en) * 2020-07-23 2020-11-20 招联消费金融有限公司 Electronic file management method and device, computer equipment and storage medium
CN112631869A (en) * 2020-12-28 2021-04-09 深圳市彬讯科技有限公司 Page loading data monitoring method and device, computer equipment and storage medium
CN112650625A (en) * 2020-12-28 2021-04-13 武汉达梦数据技术有限公司 Streaming backup restoration method, storage medium and device for database

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080307011A1 (en) * 2007-06-07 2008-12-11 International Business Machines Corporation Failure recovery and error correction techniques for data loading in information warehouses
US20100011026A1 (en) * 2008-07-10 2010-01-14 International Business Machines Corporation Method and system for dynamically collecting data for checkpoint tuning and reduce recovery time
CN103412803A (en) * 2013-08-15 2013-11-27 华为技术有限公司 Data recovering method and device
CN105022676A (en) * 2014-04-22 2015-11-04 大唐软件技术股份有限公司 Recovery method and device of main memory database redo log files

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080307011A1 (en) * 2007-06-07 2008-12-11 International Business Machines Corporation Failure recovery and error correction techniques for data loading in information warehouses
US20100011026A1 (en) * 2008-07-10 2010-01-14 International Business Machines Corporation Method and system for dynamically collecting data for checkpoint tuning and reduce recovery time
CN103412803A (en) * 2013-08-15 2013-11-27 华为技术有限公司 Data recovering method and device
CN105022676A (en) * 2014-04-22 2015-11-04 大唐软件技术股份有限公司 Recovery method and device of main memory database redo log files

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966533A (en) * 2020-07-23 2020-11-20 招联消费金融有限公司 Electronic file management method and device, computer equipment and storage medium
CN112631869A (en) * 2020-12-28 2021-04-09 深圳市彬讯科技有限公司 Page loading data monitoring method and device, computer equipment and storage medium
CN112650625A (en) * 2020-12-28 2021-04-13 武汉达梦数据技术有限公司 Streaming backup restoration method, storage medium and device for database

Similar Documents

Publication Publication Date Title
US11429641B2 (en) Copying data changes to a target database
Ganesan et al. Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to file-system faults
US10372559B2 (en) Managing a redundant computerized database using a replicated database cache
US8132043B2 (en) Multistage system recovery framework
Khetrapal et al. HBase and Hypertable for large scale distributed storage systems
US8732121B1 (en) Method and system for backup to a hidden backup storage
CN108664359B (en) Database recovery method, device, equipment and storage medium
US10146630B1 (en) Block changes framework for delta file incremental backup
US11269927B2 (en) Transactional replicator
KR20180021679A (en) Backup and restore from a distributed database using consistent database snapshots
US9542279B2 (en) Shadow paging based log segment directory
US20180046548A1 (en) Method and Apparatus for Tracking Objects in a First Memory
WO2017114263A1 (en) System, apparatus and method for avoiding data loss on recovery of bulk load
US11899540B2 (en) Regenerating a chain of backups
US11334445B2 (en) Using non-volatile memory to improve the availability of an in-memory database
US10877868B2 (en) Applying a log to storage segments
Krogh et al. Pro MySQL NDB Cluster
US11934275B2 (en) Backup copy validation as an embedded object
US20230315881A1 (en) Universal platform for data protection
KR102005727B1 (en) Multiple snapshot method based on change calculation hooking technique of file system
US11899538B2 (en) Storage integrated differential block based backup
US11880283B2 (en) Backup copy validation as a workflow
Zhou et al. FoundationDB: A Distributed Key Value Store
Manolache et al. Double Standard Method for Designing Adaptive Backup Systems
Sun Detecting and understanding crash-consistency bugs across the parallel I/O stack

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16881052

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE