EP3128445B1 - Data archive vault in big data platform - Google Patents

Data archive vault in big data platform Download PDF

Info

Publication number
EP3128445B1
EP3128445B1 EP16001738.0A EP16001738A EP3128445B1 EP 3128445 B1 EP3128445 B1 EP 3128445B1 EP 16001738 A EP16001738 A EP 16001738A EP 3128445 B1 EP3128445 B1 EP 3128445B1
Authority
EP
European Patent Office
Prior art keywords
data
information
vault
engine
data structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
EP16001738.0A
Other languages
German (de)
French (fr)
Other versions
EP3128445A1 (en
Inventor
Axel Herbst
Veit Bolik
Mathias Roeher
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SAP SE
Original Assignee
SAP SE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SAP SE filed Critical SAP SE
Publication of EP3128445A1 publication Critical patent/EP3128445A1/en
Application granted granted Critical
Publication of EP3128445B1 publication Critical patent/EP3128445B1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/219Managing data history or versioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Definitions

  • Embodiments relate to handling large data volumes, and in particular, to a vault archive implemented in a big data platform.
  • big data can include unstructured postings and shared documents available from social media.
  • other types of structured data can also be stored, including rapidly increasing volumes of financial data for processing by business management systems.
  • Inexpensive long-term storage of historical data calls for the ability to use those data assets - for example to maintain the information that the data represents, and allow for flexible data analysis (reporting). This data storage ability is desired across even the classical silos.
  • MILENA IVANOVA ET AL Data Vaults: Database Technology for Scientific File Repositories", COMPUTING IN SCIENCE AND ENGINEERING, IEEE SERVICE CENTER, LOS ALAMITOS, CA, US, vol. 15, no. 3, 1 May 2013, pages 32-42, XP011518825, ISSN: 1521-9615, DOI: 10.1109/MCSE.2013.17 ) describe a data vault approach to let researches effectively and efficiently explore and analyse information.
  • Embodiments relate to data archiving utilizing an existing big data platform (e.g., HADOOP) as a cost-effective target infrastructure for storage.
  • an existing big data platform e.g., HADOOP
  • Particular embodiments construct a logical structure (hereafter, "vault") in the big data platform so that a source, type, and context of the data is maintained, and metadata can be added to aid searching for snapshots according to a given time, version, and other considerations.
  • a vaulting process transforms relationally stored data in an object view to allow for object-based retrieval or object-wise operations (such as destruction due to legal data privacy reasons), and provide references to also store unstructured data (e.g., sensor data, documents, streams) as attachments.
  • a legacy archive extractor provides extraction services for existing archives, so that extracted information is stored in the same vault. This allows for cross queries over legacy data and data from other sources, facilitating the application of new analysis techniques by data scientists.
  • An embodiment of a computer-implemented method comprises, an engine of a big data platform receiving from an application layer, a first input comprising a plurality of fields organized in a first data structure.
  • the engine receives from the application layer, context information relevant to the first data structure.
  • the engine stores in a vault of the big data platform, values of the plurality of fields and the context information organized as a second data structure different from the first data structure.
  • a non-transitory computer readable storage medium embodies a computer program for performing a method comprising an engine of a big data platform an engine of a big data platform receiving a first input comprising a plurality of fields organized in a first data structure.
  • the engine receives context information relevant to the first data structure.
  • the engine stores in a vault of the big data platform, values of the plurality of fields and the context information organized as a second data structure different from the first data structure.
  • a computer system comprises one or more processors and a software program, executable on said computer system.
  • the software program is configured to cause an engine of a big data platform to receive a first input comprising a plurality of fields organized in a first data structure.
  • the software program is further configured to cause the engine to receive context information relevant to the first data structure, and to store in a cluster of the big data platform, values of the plurality of fields in a plurality of storage nodes, and store the context information in a vault catalog, organized as a second data structure different from the first data structure.
  • the vault comprises a cluster of storage nodes and calculation nodes.
  • the context information is stored in a catalog of the vault, and values of the plurality of fields are stored in a subset of the storage nodes.
  • the values are denormalized.
  • Certain embodiments further comprise the engine handling the second data structure without processing the context information.
  • the context information comprises time information, version information, or structure information.
  • Various embodiments further comprise the engine processing the context information to handle the second data structure.
  • the context information comprises compliance information.
  • Some embodiments further comprises the engine receiving the first data structure from a database, and the engine aging the first data structure within the big data platform.
  • Described herein are methods and apparatuses configured to perform data archiving in a big data platform.
  • numerous examples and specific details are set forth in order to provide a thorough understanding of the present invention.
  • the subject-matter of the present invention is achieved by the appended claims.
  • Embodiments relate to data archiving utilizing an existing big data platform (e.g., HADOOP) as a cost-effective target infrastructure for storage.
  • an existing big data platform e.g., HADOOP
  • Particular embodiments construct a logical structure (hereafter, "vault") in the big data platform so that a source, type, and context of the data is maintained, and metadata can be added to aid searching for snapshots according to a given time, version, and other considerations.
  • a vaulting process transforms relationally stored data in an object view to allow for object-based retrieval or object-wise operations (such as destruction due to legal data privacy reasons), and provide references to also store unstructured data (e.g., sensor data, documents, streams) as attachments.
  • a legacy archive extractor provides extraction services for existing archives, so that extracted information is stored in the same vault. This allows for cross queries over legacy data and data from other sources, facilitating the application of new analysis techniques by data scientists.
  • Figure 1 presents a simplified view of a system 100 according to an embodiment.
  • a user 102 interacts with an application layer 104 to perform handling and analysis of data.
  • the data being manipulated by the application layer is typically organized into a larger data object 106 comprising a plurality of data fields 108.
  • the data object comprises the individual data fields A", B', G, and L, and its general structure is indicated by a rectangular shape.
  • the data object is typically stored and retrieved in an underlying database 110.
  • Data is typically stored in the database for access over relatively short periods, on the order of months to a few years.
  • the database and/or the data object may evolve.
  • the database may be updated to exhibit added features/ functionality.
  • the data object may also change over time to reflect modified/different fields.
  • the data object 106 ( ⁇ A"B'GL) has evolved from earlier structures.
  • One earlier data structure included previous incarnations of the A and B fields, and a different field Z.
  • a still earlier data structure included only fields A and B in their original form.
  • Figure 1 further shows the application layer as in communication with a big data platform 120.
  • This big data platform comprises an engine 122, and further comprises a vault 124.
  • the engine is configured to receive a data object and context information 125 relevant thereto, and to store same together within the archive for retrieval over long periods.
  • the vault is configured to store data (e.g., as previously present in various data objects), together with relevant reference data R n providing various pieces of contextual information regarding that data.
  • Figure 1 shows the vault as comprising a cluster 140 comprising a first plurality of nodes 142 configured to store data object content, and a second plurality of nodes 144 (e.g., a catalog) configured to store associated reference information.
  • Figure 1 shows a copy 126 of the data A"B'GL of the current data object, archived together with reference data R 1 indicating a time (date) of archiving of that data.
  • the data object 126 ⁇ A"B'GLR 1
  • the large capacity volume of the big data platform allows many such snapshots of the same data object to be reliably stored in the vault, over long periods.
  • embodiments can archive data efficiently, conserving storage resources by allowing state of the art techniques (e.g., for compression, delta storage, or deduplication) to be applied in order to store identical/common/unchanged parts of the snapshots only once.
  • state of the art techniques e.g., for compression, delta storage, or deduplication
  • the vault can also be utilized to archive snapshots of different versions of the data object.
  • a snapshot of an earlier-version data object 128 ( ⁇ ABR 1 ) is also stored in the vault.
  • the reference information stored in the vault associated with the data is not limited to the time information just discussed.
  • the archived data object 130 may additionally include another type of reference information, for example identifying the specific version of the application software for which data object 130 was designed to be compatible. Such reference information can be valuable for recognizing the function of the archived data object many years after the fact.
  • a wide variety of reference information can be stored in the vault with the archived data.
  • data objects ⁇ P and ⁇ Q from a different source 132 entirely may be relevant to the application and thus also sought to be archived.
  • Figure 1 also shows the vault as storing data objects ⁇ PR 3 and ⁇ QR 1 R 3 that include reference information specifically identifying those different data object organizational schemas for future recognition.
  • reference information associated with the archived data is not limited to the time, software version, or data object organizational schema, and may include any type of information.
  • Such archived information may be active or passive. Active archived information is "understood” and processed/enforced by the engine. Examples of active archived information can include but are not limited to:
  • a compliance property "legal hold active” may serve to prevent a client from modifying archived data.
  • a retention property "expiry” that includes a future date, protects the archived data from being deleted.
  • access policies bound to the data in the vault serve to control visibility of the data to users.
  • Passive information stored in the archive may include properties deemed useful for later understanding of the archived data, but which are not necessarily processed by an engine. Examples of passive information can include but are not limited to:
  • the processing engine of the big data platform is leveraged to create and populate the vault with archived data objects.
  • An example of this vaulting process is described in connection with the example below.
  • Figure 1 also shows the database as being in communication with the big data platform.
  • This communication may be in the form of a data aging process.
  • a data aging process refers to the transfer of data from the database to the big data platform based upon considerations that may include an age of the data.
  • This data aging avoids accumulating old/stale/infrequently accessed data within the relatively limited/ expensive confines of the database designed to accommodate detailed querying.
  • such data aging may differ from data archiving in a number of respects.
  • One such respect is the form of storage of the data within the big data platform.
  • the structure of aged data object may be preserved in the big data platform.
  • Such tight coupling between the aged data and external data can facilitate its rapid recall to the database if necessary.
  • tight coupling may not afford certain benefits accrued from looser coupling between archived data and external data as is further described below.
  • Data aging executed by the database in conjunction with the big data platform may also differ from data archiving with respect to the location of the aged data. That is, the aged data may be located in the big data platform outside of the vault.
  • the data aging process may not necessarily produce the associated reference data that is useful for archiving purposes. Rather, the location of the aged data in its existing form (lacking context information), may simply be shifted to the big data platform to serve as a most cost effective vessel for containing information.
  • FIG. 2 is a simplified flow diagram showing a method 200 according to an embodiment.
  • an engine of a big data platform receives as first input, a data object comprising various fields organized in a first structure.
  • the engine receives a second input in the form of contextual information relevant to the data object.
  • contextual information can include but is not limited to, time information, version information, source information, type information, and/or data object structure information.
  • the engine causes the values of the fields to be stored in a vault of the big data platform, associated with the contextual information as a data object having a second structure.
  • the vault may comprise a cluster with storage and compute nodes.
  • a subset of nodes may be established as vault catalog storing metadata that attributes the objects.
  • Another node subset may be established as content nodes for storing the objects themselves.
  • the contextual information may be processed by the engine in subsequent data handling of the second structure (e.g., read and/or write access, expiry, compliance, etc.)
  • Figure 3 shows a simplified view of the system according to this example.
  • HADOOP is utilized as a most cost efficient target infrastructure for storage.
  • a Universal Object Vault (logical structure) is built in HADOOP so that the source, type, and context of the data is maintained, and metadata can be added that aids in searching for snapshots at a given time, versions etc.
  • Figure 3 shows data archived in the vault of the big data platform, accessible to a variety of potential users, for a variety of potential purposes. For example, documents may be attached and entered into the vault for archiving directly.
  • Figure 3 also shows the input of data to the vault via a Legacy Archive extractor process, which is described further below.
  • Figure 3 further shows access to data archived in the vault, by a native client.
  • An example of such access could be by a data scientist who is seeking to perform analysis of the archived data.
  • the specific system shown in Figure 3 further includes storage of data from overlying application(s) - (e.g., Suite, S/4 available from SAP SE of Walldorf, Germany) - utilizing the HANA in-memory database, also which are available from SAP SE.
  • a VELOCITY engine is also employed as part of data handling including scale-out extension of the HANA in-memory database.
  • Figure 3 further shows the HANA in-memory database utilizing the HADOOP platform for purposes of data aging.
  • data aging may or may not ultimately involve migration of aged data within the big data platform into the Universal Object Vault region itself.
  • Figure 4A illustrates an example orchestration of a simplified HADOOP cluster according to an embodiment, including a vault 400.
  • a subset of nodes is established as vault catalog (a structure for metadata that attributes the objects). Another subset is established as content nodes (storing the objects).
  • the catalog is made known to external processing engine (e.g., here the HANA SQL optimizer).
  • external processing engine e.g., here the HANA SQL optimizer.
  • Connectivity of clients/adaptors (e.g. HANA SQL query executor) to content is established using standard HADOOP APIs.
  • Figure 4B shows a simplified view illustrating definition of the logical structure of the vault according to one embodiment.
  • This logical structure comprises different areas for content, and a collection of metadata in the vault catalog.
  • some metadata types may be processed by the engine to actively control archive behavior during later data handling activities.
  • Metadata can include but are not limited to:
  • Relationally stored data is transformed into an object view to allow for object-based retrieval or object-wise operations (such as expiry/destruction in compliance with legal data privacy considerations), and to allow for references to unstructured data also stored such as attachments, sensor data, documents, streams.
  • This vaulting process allows conserving "informational relationship knowledge" over long periods of time (e.g., 5-30 years or even longer). By looking only at especially normalized relational database tables, one cannot reconstruct the "natural" business object. This is because foreign keys - the common practice for expressing relationships - are not typically part of a database schema.
  • Figure 4C shows a sample of table data.
  • the table data of Figure 4D may be denormalized, such as by serializing into the natural object structure (object-wise clustering by executing joins and materializing the result set, using an open format).
  • An example of the denormalizing the table data of Figure 4C is shown in Figure 4D .
  • Embodiments of vaulting process add context and meta data to preserve the history and interpretability of the primary data over decade-long time spans. Embodiments may also aid in eliminating redundant data stores. Possible examples include duplicate files of the SAP Data Retention Tool (DaRT) for tax and other audit purposes.
  • DaRT SAP Data Retention Tool
  • Embodiments may also provide extraction services that allow archiving of data from other sources.
  • extraction process may also be referred to as a Legacy Archive extractor process.
  • the extracted information is stored in the same vault, allowing for cross queries over legacy data and data from other sources. This permits the application of new analysis techniques currently employed by, e.g. Data scientistss.
  • SAP Archive Development Kit ADK
  • HDFS Distributed File System
  • utilization of a vaulting processes with a vault structure constructed within a big data platform may offer one or more benefits.
  • One possible advantage is relatively loose coupling of information.
  • Loose coupling between data internal/external to the archive vault also facilitates possible desirable separation of backup and/or recovery functions. That is, the data in the application can be backed up and/or recovered, independent of the state of the data within the vault. Thus while under some circumstances the vault may be useful for such backup/recovery processes, under other circumstances separate mechanism(s) dedicated to performing backup/recovery may be better suited to those roles.
  • Loose coupling between data in the vault and data external thereto may also promote data access. That is, such a configuration allows n:m server usage for client systems, permitting data archiving services to be offered to different clients.
  • data archive vault is self-contained. That is, relational knowledge (even over decade-long periods) is conserved. Reference to object views of data within the vault, offers explicit recognition and appreciation of associated reference data (e.g., data versioning info, dates, other relevant metadata) providing context enrichment.
  • a conventional database schema may not preserve this type of context information (e.g., foreign keys) at all, or may not store it in a consistent format amenable to preservation/access over (decade) long time periods.
  • data archive vaulting provides flexibility, in that the data is not required to come from one particular source (e.g., the suite of related business applications available from SAP). Rather, the archive can accommodate data from a heterogeneous mixture of sources, particularly considering implementation of the Legacy Archive Extractor process.
  • an archiving approach utilizing a vault within a big data platform can readily be tailored to meet various compliance requirements (e.g., regulatory, contractual) arising within the data storage environment. Examples can include mandated data expiry/deletion of personal data, and restrictions in the subsequent modification of data subsequent to its initial storage.
  • a data archive vault implemented in a big data platform, and provide standard format/access (e.g. PARQUET, SPARK) for open processing by newly-developed applications.
  • standard format/access e.g. PARQUET, SPARK
  • a data archive vault may facilitate easier transition based on aging checks. That is, the power and flexibility associated with data archiving according to embodiments, may promote the performance of separate data aging processes in a more efficient and/or effective manner.
  • Figure 5 illustrates hardware of a special purpose computing machine configured to implement data archiving according to an embodiment.
  • computer system 501 comprises a processor 502 that is in electronic communication with a non-transitory computer-readable storage medium 503.
  • This computer-readable storage medium has stored thereon code 505 corresponding to a data archive vault.
  • Code 504 corresponds to an engine.
  • Code may be configured to reference data stored in a database of a non-transitory computer-readable storage medium, for example as may be present locally or in a remote database server.
  • Software servers together may form a cluster or logical network of computer systems programmed with software programs that communicate with each other and work together in order to process requests.
  • the engine is shown as being part of a database.
  • Such an embodiment can correspond to applications performing processing by a powerful engine available as part of an in-memory database (e.g., the HANA in-memory database available from SAP SE of Walldorf, Germany).
  • a powerful engine available as part of an in-memory database (e.g., the HANA in-memory database available from SAP SE of Walldorf, Germany).
  • the engine may be implemented in other ways, for example as part of an overlying application layer.
  • Computer system 610 includes a bus 605 or other communication mechanism for communicating information, and a processor 601 coupled with bus 605 for processing information.
  • Computer system 610 also includes a memory 602 coupled to bus 605 for storing information and instructions to be executed by processor 601, including information and instructions for performing the techniques described above, for example.
  • This memory may also be used for storing variables or other intermediate information during execution of instructions to be executed by processor 601. Possible implementations of this memory may be, but are not limited to, random access memory (RAM), read only memory (ROM), or both.
  • a storage device 603 is also provided for storing information and instructions.
  • Storage devices include, for example, a hard drive, a magnetic disk, an optical disk, a CD-ROM, a DVD, a flash memory, a USB memory card, or any other medium from which a computer can read.
  • Storage device 603 may include source code, binary code, or software files for performing the techniques above, for example.
  • Storage device and memory are both examples of computer readable mediums.
  • Computer system 610 may be coupled via bus 605 to a display 612, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user.
  • a display 612 such as a cathode ray tube (CRT) or liquid crystal display (LCD)
  • An input device 611 such as a keyboard and/or mouse is coupled to bus 605 for communicating information and command selections from the user to processor 601. The combination of these components allows the user to communicate with the system.
  • bus 605 may be divided into multiple specialized buses.
  • Computer system 610 also includes a network interface 604 coupled with bus 605.
  • Network interface 604 may provide two-way data communication between computer system 610 and the local network 620.
  • the network interface 604 may be a digital subscriber line (DSL) or a modem to provide data communication connection over a telephone line, for example.
  • DSL digital subscriber line
  • Another example of the network interface is a local area network (LAN) card to provide a data communication connection to a compatible LAN.
  • LAN local area network
  • Wireless links are another example.
  • network interface 604 sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information.
  • Computer system 610 can send and receive information, including messages or other interface actions, through the network interface 604 across a local network 620, an Intranet, or the Internet 630.
  • computer system 610 may communicate with a plurality of other computer machines, such as server 615.
  • server 615 may form a cloud computing network, which may be programmed with processes described herein.
  • software components or services may reside on multiple different computer systems 610 or servers 631-635 across the network.
  • the processes described above may be implemented on one or more servers, for example.
  • a server 631 may transmit actions or messages from one component, through Internet 630, local network 620, and network interface 604 to a component on computer system 610.
  • the software components and processes described above may be implemented on any computer system and send and/or receive information across a network, for example.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Description

    BACKGROUND
  • Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
  • Embodiments relate to handling large data volumes, and in particular, to a vault archive implemented in a big data platform.
  • With the evolution in sophistication and complexity of databases, stored data has become available for visualization and analysis in increasingly large volumes. Such "big data" may comprise millions or even billions of different records.
  • Examples of big data can include unstructured postings and shared documents available from social media. However, other types of structured data can also be stored, including rapidly increasing volumes of financial data for processing by business management systems.
  • Even though data of many kinds (e.g., unstructured and structured) is growing exponentially, it may be desired to retain that data for many years. This desire to archive data may be attributable to business value considerations and/or legal reasons.
  • Inexpensive long-term storage of historical data calls for the ability to use those data assets - for example to maintain the information that the data represents, and allow for flexible data analysis (reporting). This data storage ability is desired across even the classical silos.
  • In one example, it may be necessary to store a communication history together with the closing of a deal. In another example, it may be necessary to relate sensor data to a maintenance request.
  • Conventionally, storing such large volumes of data can be expensive. With such large data volumes at issue, difficulties can arise in preserving the data in a manner that allows cross-querying, where the data is stored unrelatedly in different silos. It can also be a challenge to keep track of the historical state of the data, given changes in the system environment over time, and also evolution in the data structures themselves.
  • MILENA IVANOVA ET AL ("Data Vaults: Database Technology for Scientific File Repositories", COMPUTING IN SCIENCE AND ENGINEERING, IEEE SERVICE CENTER, LOS ALAMITOS, CA, US, vol. 15, no. 3, 1 May 2013, pages 32-42, XP011518825, ISSN: 1521-9615, DOI: 10.1109/MCSE.2013.17) describe a data vault approach to let researches effectively and efficiently explore and analyse information.
  • RON DUPLAIN ET AL ("Data Vaults: providing simple web access to NRAO data archives", OPTICAL SENSING II, vol. 7019, 8 August 2008 (2008-08-08), page 70191A, XP055325724, 1000 20th St. Bellingham WA 98225-6705 USA, ISSN: 0277-786X, DOI: 10.1117/12.789402, ISBN: 978-1-62841-971-9) describe a data vault project with new approaches to searching and browsing contents of all data from telescopes of the National Radio Astronomy Observatory.
  • SUMMARY
  • Embodiments relate to data archiving utilizing an existing big data platform (e.g., HADOOP) as a cost-effective target infrastructure for storage. Particular embodiments construct a logical structure (hereafter, "vault") in the big data platform so that a source, type, and context of the data is maintained, and metadata can be added to aid searching for snapshots according to a given time, version, and other considerations. A vaulting process transforms relationally stored data in an object view to allow for object-based retrieval or object-wise operations (such as destruction due to legal data privacy reasons), and provide references to also store unstructured data (e.g., sensor data, documents, streams) as attachments. A legacy archive extractor provides extraction services for existing archives, so that extracted information is stored in the same vault. This allows for cross queries over legacy data and data from other sources, facilitating the application of new analysis techniques by data scientists.
  • An embodiment of a computer-implemented method comprises, an engine of a big data platform receiving from an application layer, a first input comprising a plurality of fields organized in a first data structure. The engine receives from the application layer, context information relevant to the first data structure. The engine stores in a vault of the big data platform, values of the plurality of fields and the context information organized as a second data structure different from the first data structure.
  • A non-transitory computer readable storage medium embodies a computer program for performing a method comprising an engine of a big data platform an engine of a big data platform receiving a first input comprising a plurality of fields organized in a first data structure. The engine receives context information relevant to the first data structure. The engine stores in a vault of the big data platform, values of the plurality of fields and the context information organized as a second data structure different from the first data structure.
  • A computer system according to an embodiment comprises one or more processors and a software program, executable on said computer system. The software program is configured to cause an engine of a big data platform to receive a first input comprising a plurality of fields organized in a first data structure. The software program is further configured to cause the engine to receive context information relevant to the first data structure, and to store in a cluster of the big data platform, values of the plurality of fields in a plurality of storage nodes, and store the context information in a vault catalog, organized as a second data structure different from the first data structure.
  • In some embodiments the vault comprises a cluster of storage nodes and calculation nodes.
  • In particular embodiments the context information is stored in a catalog of the vault, and values of the plurality of fields are stored in a subset of the storage nodes.
  • According to various embodiments the values are denormalized.
  • Certain embodiments further comprise the engine handling the second data structure without processing the context information.
  • In some embodiments the context information comprises time information, version information, or structure information.
  • Various embodiments further comprise the engine processing the context information to handle the second data structure.
  • According to particular embodiments the context information comprises compliance information.
  • Some embodiments further comprises the engine receiving the first data structure from a database, and the engine aging the first data structure within the big data platform.
  • The following detailed description and accompanying drawings provide a better understanding of the nature and advantages of embodiments.
  • BRIEF DESCRIPTION OF THE DRAWINGS
    • Figure 1 shows a simplified view of a system according to an embodiment.
    • Figure 2 shows a simplified process flow according to an embodiment.
    • Figure 3 shows a simplified view of an example of a system.
    • Figure 4A illustrates an example orchestration of a simplified HADOOP cluster.
    • Figure 4B shows a simplified view illustrating definition of the logical structure according to one embodiment.
    • Figure 4C shows an example of table data.
    • Figure 4D shows denormalization of the exemplary table data of Figure 4C.
    • Figure 5 illustrates hardware of a special purpose computing machine configured to perform archiving according to an embodiment.
    • Figure 6 illustrates an example computer system.
    DETAILED DESCRIPTION
  • Described herein are methods and apparatuses configured to perform data archiving in a big data platform. In the following description, for purposes of explanation, numerous examples and specific details are set forth in order to provide a thorough understanding of the present invention. The subject-matter of the present invention is achieved by the appended claims.
  • Embodiments relate to data archiving utilizing an existing big data platform (e.g., HADOOP) as a cost-effective target infrastructure for storage. Particular embodiments construct a logical structure (hereafter, "vault") in the big data platform so that a source, type, and context of the data is maintained, and metadata can be added to aid searching for snapshots according to a given time, version, and other considerations. A vaulting process transforms relationally stored data in an object view to allow for object-based retrieval or object-wise operations (such as destruction due to legal data privacy reasons), and provide references to also store unstructured data (e.g., sensor data, documents, streams) as attachments. A legacy archive extractor provides extraction services for existing archives, so that extracted information is stored in the same vault. This allows for cross queries over legacy data and data from other sources, facilitating the application of new analysis techniques by data scientists.
  • Figure 1 presents a simplified view of a system 100 according to an embodiment. In particular, a user 102 interacts with an application layer 104 to perform handling and analysis of data.
  • The data being manipulated by the application layer, is typically organized into a larger data object 106 comprising a plurality of data fields 108. Here the data object comprises the individual data fields A", B', G, and L, and its general structure is indicated by a rectangular shape.
  • As part of its manipulation and analysis in the application layer, the data object is typically stored and retrieved in an underlying database 110. Data is typically stored in the database for access over relatively short periods, on the order of months to a few years.
  • However, over longer periods (e.g., 5+ years) the database and/or the data object may evolve. For example, the database may be updated to exhibit added features/ functionality.
  • The data object may also change over time to reflect modified/different fields. For example, here the data object 106 (●A"B'GL) has evolved from earlier structures.
  • One earlier data structure (●A'BZ) included previous incarnations of the A and B fields, and a different field Z. A still earlier data structure (●AB) included only fields A and B in their original form.
  • As described herein, it may be desirable to archive these earlier data structures for reference at some future date, even over decades-long periods of time. For various reasons, however, (e.g., database evolution, lack of foreign key storage, cost), the database itself may not be a cost-effective vehicle for archiving data over such long periods.
  • Accordingly, Figure 1 further shows the application layer as in communication with a big data platform 120. This big data platform comprises an engine 122, and further comprises a vault 124. The engine is configured to receive a data object and context information 125 relevant thereto, and to store same together within the archive for retrieval over long periods.
  • Specifically, the vault is configured to store data (e.g., as previously present in various data objects), together with relevant reference data Rn providing various pieces of contextual information regarding that data. Figure 1 shows the vault as comprising a cluster 140 comprising a first plurality of nodes 142 configured to store data object content, and a second plurality of nodes 144 (e.g., a catalog) configured to store associated reference information.
  • For example, Figure 1 shows a copy 126 of the data A"B'GL of the current data object, archived together with reference data R1 indicating a time (date) of archiving of that data. Taken together, the data object 126 (●A"B'GLR1) can be referred to as a "snapshot". The large capacity volume of the big data platform allows many such snapshots of the same data object to be reliably stored in the vault, over long periods.
  • Even considering the large capacity of the big data platform, embodiments can archive data efficiently, conserving storage resources by allowing state of the art techniques (e.g., for compression, delta storage, or deduplication) to be applied in order to store identical/common/unchanged parts of the snapshots only once.
  • Moreover, the vault can also be utilized to archive snapshots of different versions of the data object. Here, a snapshot of an earlier-version data object 128 (●ABR1) is also stored in the vault.
  • The reference information stored in the vault associated with the data, is not limited to the time information just discussed. For example, the archived data object 130 (●A'BZR1R2) may additionally include another type of reference information, for example identifying the specific version of the application software for which data object 130 was designed to be compatible. Such reference information can be valuable for recognizing the function of the archived data object many years after the fact.
  • A wide variety of reference information can be stored in the vault with the archived data. For example, data objects ●P and ●Q from a different source 132 entirely, may be relevant to the application and thus also sought to be archived.
  • Those data objects ●P and ●Q, however, may be organized according to a schema that is fundamentally different from that utilized by the current application (as indicated by their respective triangular and circular shapes). Accordingly, Figure 1 also shows the vault as storing data objects ●PR3 and ●QR1R3 that include reference information specifically identifying those different data object organizational schemas for future recognition.
  • Of course the nature of the reference information associated with the archived data is not limited to the time, software version, or data object organizational schema, and may include any type of information.
  • Such archived information may be active or passive. Active archived information is "understood" and processed/enforced by the engine. Examples of active archived information can include but are not limited to:
    • lifetime conditions (e.g., expiry);
    • access policies;
    • compliance policies; and
    • others.
  • Thus a compliance property "legal hold active" may serve to prevent a client from modifying archived data. In another example, a retention property "expiry" that includes a future date, protects the archived data from being deleted. Also, access policies bound to the data in the vault serve to control visibility of the data to users.
  • Passive information stored in the archive may include properties deemed useful for later understanding of the archived data, but which are not necessarily processed by an engine. Examples of passive information can include but are not limited to:
    • version number of the data object;
    • definition of the field(s);
    • time of data vaulting;
    • time of initial data creation;
    • status as original (master) data or data copy;
    • source of the data; and
    • many others.
  • According to embodiments, the processing engine of the big data platform is leveraged to create and populate the vault with archived data objects. An example of this vaulting process is described in connection with the example below.
  • Figure 1 also shows the database as being in communication with the big data platform. This communication may be in the form of a data aging process. As described herein, such a data aging process refers to the transfer of data from the database to the big data platform based upon considerations that may include an age of the data. This data aging avoids accumulating old/stale/infrequently accessed data within the relatively limited/ expensive confines of the database designed to accommodate detailed querying.
  • As shown in Figure 1, such data aging may differ from data archiving in a number of respects. One such respect is the form of storage of the data within the big data platform.
  • Specifically, in certain embodiments the structure of aged data object may be preserved in the big data platform. Such tight coupling between the aged data and external data can facilitate its rapid recall to the database if necessary. However, tight coupling may not afford certain benefits accrued from looser coupling between archived data and external data as is further described below.
  • Data aging executed by the database in conjunction with the big data platform, may also differ from data archiving with respect to the location of the aged data. That is, the aged data may be located in the big data platform outside of the vault.
  • Finally, the data aging process may not necessarily produce the associated reference data that is useful for archiving purposes. Rather, the location of the aged data in its existing form (lacking context information), may simply be shifted to the big data platform to serve as a most cost effective vessel for containing information.
  • Figure 2 is a simplified flow diagram showing a method 200 according to an embodiment. In a first step 202, an engine of a big data platform receives as first input, a data object comprising various fields organized in a first structure.
  • In a second step 204, the engine receives a second input in the form of contextual information relevant to the data object. Such contextual information can include but is not limited to, time information, version information, source information, type information, and/or data object structure information.
  • In a third step 206, the engine causes the values of the fields to be stored in a vault of the big data platform, associated with the contextual information as a data object having a second structure. In certain embodiments the vault may comprise a cluster with storage and compute nodes. A subset of nodes may be established as vault catalog storing metadata that attributes the objects. Another node subset may be established as content nodes for storing the objects themselves.
  • In an optional fourth step 208, the contextual information may be processed by the engine in subsequent data handling of the second structure (e.g., read and/or write access, expiry, compliance, etc.)
  • Further details regarding implementation of archiving utilizing a big data platform according to embodiments, are now provided in connection with the following example.
  • Example
  • One example implementing archiving according to embodiments, is now described in connection with the HADOOP big data platform, available from the APACHE SOFTWARE FOUNDATION. Figure 3 shows a simplified view of the system according to this example.
  • Here, HADOOP is utilized as a most cost efficient target infrastructure for storage. A Universal Object Vault (logical structure) is built in HADOOP so that the source, type, and context of the data is maintained, and metadata can be added that aids in searching for snapshots at a given time, versions etc.
  • Figure 3 shows data archived in the vault of the big data platform, accessible to a variety of potential users, for a variety of potential purposes. For example, documents may be attached and entered into the vault for archiving directly. Figure 3 also shows the input of data to the vault via a Legacy Archive extractor process, which is described further below.
  • Figure 3 further shows access to data archived in the vault, by a native client. An example of such access could be by a data scientist who is seeking to perform analysis of the archived data.
  • The specific system shown in Figure 3 further includes storage of data from overlying application(s) - (e.g., Suite, S/4 available from SAP SE of Walldorf, Germany) - utilizing the HANA in-memory database, also which are available from SAP SE. A VELOCITY engine is also employed as part of data handling including scale-out extension of the HANA in-memory database.
  • Figure 3 further shows the HANA in-memory database utilizing the HADOOP platform for purposes of data aging. A shown by the short downward arrow, such data aging may or may not ultimately involve migration of aged data within the big data platform into the Universal Object Vault region itself.
  • Specific implementation of the example of Figure 3 is now described in connection with Figures 4A-4D. First, a HADOOP cluster is set up with storage and compute nodes. Figure 4A illustrates an example orchestration of a simplified HADOOP cluster according to an embodiment, including a vault 400.
  • A subset of nodes is established as vault catalog (a structure for metadata that attributes the objects). Another subset is established as content nodes (storing the objects).
  • Finally, the catalog is made known to external processing engine (e.g., here the HANA SQL optimizer). Connectivity of clients/adaptors (e.g. HANA SQL query executor) to content is established using standard HADOOP APIs.
  • Figure 4B shows a simplified view illustrating definition of the logical structure of the vault according to one embodiment. This logical structure comprises different areas for content, and a collection of metadata in the vault catalog. As mentioned above, in certain embodiments some metadata types may be processed by the engine to actively control archive behavior during later data handling activities.
  • Examples of metadata can include but are not limited to:
    • source (system) of the objects;
    • type and subtype(s) of objects (e.g., structured purchase order, scanned invoice, Internet of Things - IoT sensor data stream of type xyz, attachment to ..., etc.);
    • time of vaulting, time of creation;
    • indication of the data as a copy (snapshot), or original data moved being moved;
    • access policies;
    • lifecycle information (e.g., at least how long to keep the data, when to destroy at the latest, involved in a mitigation hold);
    • intra-object structure (properties, field length, data types).
  • A process for archiving data in the vault is now described. Relationally stored data is transformed into an object view to allow for object-based retrieval or object-wise operations (such as expiry/destruction in compliance with legal data privacy considerations), and to allow for references to unstructured data also stored such as attachments, sensor data, documents, streams.
  • This vaulting process allows conserving "informational relationship knowledge" over long periods of time (e.g., 5-30 years or even longer). By looking only at especially normalized relational database tables, one cannot reconstruct the "natural" business object. This is because foreign keys - the common practice for expressing relationships - are not typically part of a database schema.
  • In addition, there are joins present in the application coding to reconstruct objects, and also additional dependencies may be "hidden" in the applications. But, such applications change (and even disappear entirely) over time, such that the object structure may eventually be lost. This limits the interpretability and usability of the data over the long term.
  • By contrast, upon performance of the vaulting process according to embodiments, objects may be materialized. For example, Figure 4C shows a sample of table data.
  • This example instance makes use of NoSQL structures (no strict relational representation as long-term data model). Accordingly, the table data of Figure 4D may be denormalized, such as by serializing into the natural object structure (object-wise clustering by executing joins and materializing the result set, using an open format). An example of the denormalizing the table data of Figure 4C, is shown in Figure 4D.
  • While the particular example of Figure 4D shows denormalization, other methods/formats are possible. Examples include but are not limited to, document store-like attribute/value pairs, XML serialization, JSON conversion, and Hive serialization/ deserialization (SerDes).
  • Embodiments of vaulting process add context and meta data to preserve the history and interpretability of the primary data over decade-long time spans. Embodiments may also aid in eliminating redundant data stores. Possible examples include duplicate files of the SAP Data Retention Tool (DaRT) for tax and other audit purposes.
  • Embodiments may also provide extraction services that allow archiving of data from other sources. In connection with this example, such an extraction process may also be referred to as a Legacy Archive extractor process.
  • According to such extraction services for existing archives, the extracted information is stored in the same vault, allowing for cross queries over legacy data and data from other sources. This permits the application of new analysis techniques currently employed by, e.g. Data Scientists.
  • This example shows how SAP Installed Base customers may universally vault other archived data. SAP Archive Development Kit (ADK) archive files are fed into the vault structure by the following algorithm:
 GET dictionary information from archive file
 POPULATE Vault catalog with this info for assigned container
 WHILE data objects in archive file DO
 (GET next object
 WHILE records in data object DO
 (GET record structure
  GET record values
 )
 SERIALZE object
 ADD meta data (time, archiving run, file, ...)
 WRITE into vault
 )
  • The result is a uniform storage in the Universal Vault, that allows querying across vault containers independent of the original system. Note that this is not a simple relational reload (archives typically contain copies of data), with duplicate keys or different versions (of, e.g. master data) at a given point in time.
  • This historical context (with relaxed integrity constraints) is preserved by adding the suggested meta data. And, query processing on top of HADOOP Distributed File System (HDFS) is more flexible and scales well as compared with conventional archive indexing for static (prepared) searches over archived data.
  • According to embodiments, utilization of a vaulting processes with a vault structure constructed within a big data platform, may offer one or more benefits. One possible advantage is relatively loose coupling of information.
  • In particular, by not storing the archived data according to exactly the same structure as employed within the application layer and/or database, flexibility is imparted to the archiving system. For example, conversion of data outside the archive (e.g., to accommodate application and/or database upgrade) may be accomplished without necessarily requiring conversion of archived data within the vault at the same time.
  • Loose coupling between data internal/external to the archive vault, also facilitates possible desirable separation of backup and/or recovery functions. That is, the data in the application can be backed up and/or recovered, independent of the state of the data within the vault. Thus while under some circumstances the vault may be useful for such backup/recovery processes, under other circumstances separate mechanism(s) dedicated to performing backup/recovery may be better suited to those roles.
  • Loose coupling between data in the vault and data external thereto, may also promote data access. That is, such a configuration allows n:m server usage for client systems, permitting data archiving services to be offered to different clients.
  • It is further noted that the data archive vault is self-contained. That is, relational knowledge (even over decade-long periods) is conserved. Reference to object views of data within the vault, offers explicit recognition and appreciation of associated reference data (e.g., data versioning info, dates, other relevant metadata) providing context enrichment.
  • Such an approach may contrast with the state of data conventionally stored in a database. Specifically, a conventional database schema may not preserve this type of context information (e.g., foreign keys) at all, or may not store it in a consistent format amenable to preservation/access over (decade) long time periods.
  • It is also emphasized that data archive vaulting according to embodiments provides flexibility, in that the data is not required to come from one particular source (e.g., the suite of related business applications available from SAP). Rather, the archive can accommodate data from a heterogeneous mixture of sources, particularly considering implementation of the Legacy Archive Extractor process.
  • It is further noted that an archiving approach utilizing a vault within a big data platform according to embodiments, can readily be tailored to meet various compliance requirements (e.g., regulatory, contractual) arising within the data storage environment. Examples can include mandated data expiry/deletion of personal data, and restrictions in the subsequent modification of data subsequent to its initial storage.
  • It is further noted that a data archive vault implemented in a big data platform, and provide standard format/access (e.g. PARQUET, SPARK) for open processing by newly-developed applications.
  • Finally, a data archive vault may facilitate easier transition based on aging checks. That is, the power and flexibility associated with data archiving according to embodiments, may promote the performance of separate data aging processes in a more efficient and/or effective manner.
  • Figure 5 illustrates hardware of a special purpose computing machine configured to implement data archiving according to an embodiment. In particular, computer system 501 comprises a processor 502 that is in electronic communication with a non-transitory computer-readable storage medium 503. This computer-readable storage medium has stored thereon code 505 corresponding to a data archive vault. Code 504 corresponds to an engine. Code may be configured to reference data stored in a database of a non-transitory computer-readable storage medium, for example as may be present locally or in a remote database server. Software servers together may form a cluster or logical network of computer systems programmed with software programs that communicate with each other and work together in order to process requests.
  • It is noted that in the specific embodiment of Figure 5, the engine is shown as being part of a database. Such an embodiment can correspond to applications performing processing by a powerful engine available as part of an in-memory database (e.g., the HANA in-memory database available from SAP SE of Walldorf, Germany). However this is not required and in certain embodiments the engine may be implemented in other ways, for example as part of an overlying application layer.
  • An example computer system 600 is illustrated in Figure 6. Computer system 610 includes a bus 605 or other communication mechanism for communicating information, and a processor 601 coupled with bus 605 for processing information. Computer system 610 also includes a memory 602 coupled to bus 605 for storing information and instructions to be executed by processor 601, including information and instructions for performing the techniques described above, for example. This memory may also be used for storing variables or other intermediate information during execution of instructions to be executed by processor 601. Possible implementations of this memory may be, but are not limited to, random access memory (RAM), read only memory (ROM), or both. A storage device 603 is also provided for storing information and instructions. Common forms of storage devices include, for example, a hard drive, a magnetic disk, an optical disk, a CD-ROM, a DVD, a flash memory, a USB memory card, or any other medium from which a computer can read. Storage device 603 may include source code, binary code, or software files for performing the techniques above, for example. Storage device and memory are both examples of computer readable mediums.
  • Computer system 610 may be coupled via bus 605 to a display 612, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. An input device 611 such as a keyboard and/or mouse is coupled to bus 605 for communicating information and command selections from the user to processor 601. The combination of these components allows the user to communicate with the system. In some systems, bus 605 may be divided into multiple specialized buses.
  • Computer system 610 also includes a network interface 604 coupled with bus 605. Network interface 604 may provide two-way data communication between computer system 610 and the local network 620. The network interface 604 may be a digital subscriber line (DSL) or a modem to provide data communication connection over a telephone line, for example. Another example of the network interface is a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links are another example. In any such implementation, network interface 604 sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information.
  • Computer system 610 can send and receive information, including messages or other interface actions, through the network interface 604 across a local network 620, an Intranet, or the Internet 630. For a local network, computer system 610 may communicate with a plurality of other computer machines, such as server 615. Accordingly, computer system 610 and server computer systems represented by server 615 may form a cloud computing network, which may be programmed with processes described herein. In the Internet example, software components or services may reside on multiple different computer systems 610 or servers 631-635 across the network. The processes described above may be implemented on one or more servers, for example. A server 631 may transmit actions or messages from one component, through Internet 630, local network 620, and network interface 604 to a component on computer system 610. The software components and processes described above may be implemented on any computer system and send and/or receive information across a network, for example.
  • The above description illustrates various exemplary implementations of the present invention.
  • Claims (8)

    1. A computer-implemented method for archiving data in a vault (124, 400) of a big data platform (120), comprising:
      receiving, by an engine (122) of the big data platform, from an application layer (104), a first input of relationally stored data comprising a plurality of fields (108) organized in a first data structure;
      receiving, by the engine (122), from the application layer (104), context information (125) relevant to the first data structure, wherein the context information comprises time information, version information, source information, type information, and/or data object structure information;
      archiving, by the engine (122), in the vault (124, 400) of the big data platform (120), values of the plurality of fields (108) and the context information (125) organized as a second data structure different from the first data structure,
      the second data structure including a data object (X) comprising the values of the plurality of fields and reference data (Rn) wherein the reference data are providing various pieces of the context information (125) regarding the data object (X); and
      wherein the vault is configured and utilized to archive snapshots (126, 130, 128) of different versions of the data object (X) and reference data (Rn) regarding the data object (X); and
      wherein the vault allows object-based retrieval or object-wise operations using the archived data objects (X) and associated reference data (Rn).
    2. A method as in claim 1
      wherein the values are denormalized.
    3. A method as in claims 1 or 2 further comprising the step of handling the second data structure without processing the context information (125).
    4. A method as in claims 1 or 2 further comprising the steps of:
      processing, by the engine (122), the context information (125) to handle the second data structure; and / or
      receiving, by the engine (122), the first data structure from a database (110); and
      aging, by the engine (122), the first data structure within the big data platform (120).
    5. A method as in claim 4 wherein the context information (125) comprises compliance information.
    6. A non-transitory computer readable storage medium embodying a computer program for performing all method steps of the method of claims 1 to 5 when said program is run on a computer.
    7. A computer system (501, 610) for archiving data in a vault (124, 400) comprising:
      one or more processors (502, 601);
      a software program, executable on said computer system (501, 610), the software program configured to cause an engine (122) of a big data platform (120) to:
      receive a first input of relationally stored data comprising a plurality of fields (108) organized in a first data structure;
      receive context information (125) relevant to the first data structure, wherein the context information comprises time information, version information, source information, type information, and/or data object structure information;
      archive in the vault (124, 400) of the big data platform (120), values of the plurality of fields (108) and the context information (125), organized as a second data structure different from the first data structure, the second data structure including a data object (X) comprising the values of the plurality of fields and reference data (Rn) wherein the reference data are providing various pieces of the context information (125) regarding the data object (X); and
      wherein the vault is configured and utilized to archive snapshots (126, 130, 128) of different versions of the data object (X) and reference data (Rn) regarding the data object (X); and
      wherein the vault allows object-based retrieval or object-wise operations using the archived data objects (X) and associated reference data (Rn).
    8. A computer system (501, 610) as in claim 7
      wherein the software program further includes code to cause the engine (122) to process the context information (125) in handling the second data structure; and / or
      wherein the engine (122) is configured to denormalize the values; and / or
      wherein the software program is further configured to cause the engine (122) to age the first data structure received from a database (110).
    EP16001738.0A 2015-08-05 2016-08-04 Data archive vault in big data platform Active EP3128445B1 (en)

    Applications Claiming Priority (1)

    Application Number Priority Date Filing Date Title
    US14/818,992 US10095717B2 (en) 2015-08-05 2015-08-05 Data archive vault in big data platform

    Publications (2)

    Publication Number Publication Date
    EP3128445A1 EP3128445A1 (en) 2017-02-08
    EP3128445B1 true EP3128445B1 (en) 2019-10-02

    Family

    ID=56609651

    Family Applications (1)

    Application Number Title Priority Date Filing Date
    EP16001738.0A Active EP3128445B1 (en) 2015-08-05 2016-08-04 Data archive vault in big data platform

    Country Status (2)

    Country Link
    US (1) US10095717B2 (en)
    EP (1) EP3128445B1 (en)

    Cited By (1)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    US11847092B2 (en) 2021-02-02 2023-12-19 Business Mobile Ag Extracting SAP archive data on a non-original system

    Families Citing this family (18)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    US10453076B2 (en) * 2016-06-02 2019-10-22 Facebook, Inc. Cold storage for legal hold data
    US10956467B1 (en) * 2016-08-22 2021-03-23 Jpmorgan Chase Bank, N.A. Method and system for implementing a query tool for unstructured data files
    US10430167B2 (en) * 2017-03-22 2019-10-01 Sap Se Redistribution of data processing tasks
    US10795869B2 (en) * 2017-10-05 2020-10-06 Sap Se Automatic enforcement of data retention policy for archived data
    US11392393B2 (en) 2018-02-08 2022-07-19 Sap Se Application runtime configuration using design time artifacts
    US11426318B2 (en) * 2020-05-20 2022-08-30 Augustine Biomedical + Design, LLC Medical module including automated dose-response record system
    US11432982B2 (en) 2018-03-26 2022-09-06 Augustine Biomedical + Design, LLC Relocation module and methods for surgical equipment
    CN108573048A (en) * 2018-04-19 2018-09-25 中译语通科技股份有限公司 A kind of multidimensional data cut-in method and system, big data access system
    CN110807094A (en) * 2018-07-20 2020-02-18 林威伶 Big data analysis, prediction and data visualization system and device for legal document
    US11893026B2 (en) * 2019-04-02 2024-02-06 Sap Se Advanced multiprovider optimization
    CN111159192B (en) * 2019-12-30 2023-09-05 北京因特睿软件有限公司 Big data based data warehousing method and device, storage medium and processor
    CN111475490B (en) * 2020-04-28 2023-04-25 国网河南省电力公司信息通信公司 Data management system and method of data directory system
    US11645247B2 (en) * 2020-08-21 2023-05-09 Sap Se Ingestion of master data from multiple applications
    US11726846B2 (en) 2020-08-21 2023-08-15 Sap Se Interface for processing sensor data with hyperscale services
    US11803563B2 (en) * 2020-08-27 2023-10-31 Shopify Inc. Methods and systems for processing and storing streamed event data
    US11080264B1 (en) * 2020-10-02 2021-08-03 ActionIQ, Inc. Mutable data ingestion and storage
    CN112286882A (en) * 2020-10-30 2021-01-29 山东黄金矿业(莱州)有限公司三山岛金矿 Method for acquiring remote unstructured data to Hadoop platform in industrial production field
    EP4315020A1 (en) * 2021-03-30 2024-02-07 Jio Platforms Limited System and method of data ingestion and processing framework

    Family Cites Families (9)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    US6490620B1 (en) * 1997-09-26 2002-12-03 Worldcom, Inc. Integrated proxy interface for web based broadband telecommunications management
    US6799184B2 (en) 2001-06-21 2004-09-28 Sybase, Inc. Relational database system providing XML query support
    US7627620B2 (en) * 2004-12-16 2009-12-01 Oracle International Corporation Data-centric automatic data mining
    US8396893B2 (en) 2008-12-11 2013-03-12 Sap Ag Unified configuration of multiple applications
    US9135133B2 (en) * 2009-09-28 2015-09-15 Softlayer Technologies, Inc. Metric object tracking system
    US20120143912A1 (en) 2010-12-05 2012-06-07 Unisys Corp. Extending legacy database engines with object-based functionality
    US8949175B2 (en) 2012-04-17 2015-02-03 Turn Inc. Meta-data driven data ingestion using MapReduce framework
    US9031932B2 (en) 2012-09-06 2015-05-12 Oracle International Corporation Automatic denormalization for analytic query processing in large-scale clusters
    US9311187B2 (en) 2013-01-04 2016-04-12 Cleversafe, Inc. Achieving storage compliance in a dispersed storage network

    Non-Patent Citations (1)

    * Cited by examiner, † Cited by third party
    Title
    None *

    Cited By (1)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    US11847092B2 (en) 2021-02-02 2023-12-19 Business Mobile Ag Extracting SAP archive data on a non-original system

    Also Published As

    Publication number Publication date
    US10095717B2 (en) 2018-10-09
    EP3128445A1 (en) 2017-02-08
    US20170039227A1 (en) 2017-02-09

    Similar Documents

    Publication Publication Date Title
    EP3128445B1 (en) Data archive vault in big data platform
    CN111164585B (en) Performing in-memory rank analysis queries on externally resident data
    EP3026578B1 (en) N-bit compressed versioned column data array for in-memory columnar stores
    US8051045B2 (en) Archive indexing engine
    US9639542B2 (en) Dynamic mapping of extensible datasets to relational database schemas
    US8914414B2 (en) Integrated repository of structured and unstructured data
    Silberschatz et al. Database system concepts
    US8442982B2 (en) Extended database search
    US8311974B2 (en) Modularized extraction, transformation, and loading for a database
    US20160063030A1 (en) Query integration across databases and file systems
    US8429117B2 (en) Data loading method for a data warehouse
    JP2018136939A (en) Method for updating database based on spreadsheet for generating update data-categorized optimal query sentence
    Das et al. A study on big data integration with data warehouse
    Nurmamatovich et al. The SQL server language and its structure
    Batra SQL primer
    Brahmia et al. Versioning schemas of JSON-based conventional and temporal big data through high-level operations in the τJSchema framework
    Sreemathy et al. Data validation in ETL using TALEND
    Dutta Distributed computing technologies in big data analytics
    Sachdeva et al. Comparison of data processing tools in hadoop
    Nicola et al. DB2 pureXML cookbook: master the power of the IBM hybrid data server
    Sharma et al. MAchine readable cataloging to MAchine understandable data with distributed big data management
    Titirisca ETL as a Necessity for Business Architectures.
    Tahiri Alaoui An approach to automatically update the Spanish DBpedia using DBpedia Databus
    Alam Data migration: relational RDBMS to non-relational NoSQL
    Thirifays et al. E‐ARK Dissemination Information Package (DIP) Final Specification

    Legal Events

    Date Code Title Description
    PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

    Free format text: ORIGINAL CODE: 0009012

    STAA Information on the status of an ep patent application or granted ep patent

    Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

    AK Designated contracting states

    Kind code of ref document: A1

    Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

    AX Request for extension of the european patent

    Extension state: BA ME

    STAA Information on the status of an ep patent application or granted ep patent

    Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

    17P Request for examination filed

    Effective date: 20170727

    RBV Designated contracting states (corrected)

    Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

    STAA Information on the status of an ep patent application or granted ep patent

    Free format text: STATUS: EXAMINATION IS IN PROGRESS

    17Q First examination report despatched

    Effective date: 20180322

    REG Reference to a national code

    Ref country code: DE

    Ref legal event code: R079

    Ref document number: 602016021484

    Country of ref document: DE

    Free format text: PREVIOUS MAIN CLASS: G06F0017300000

    Ipc: G06F0016210000

    GRAP Despatch of communication of intention to grant a patent

    Free format text: ORIGINAL CODE: EPIDOSNIGR1

    STAA Information on the status of an ep patent application or granted ep patent

    Free format text: STATUS: GRANT OF PATENT IS INTENDED

    RIC1 Information provided on ipc code assigned before grant

    Ipc: G06F 16/22 20190101ALI20190228BHEP

    Ipc: G06F 16/21 20190101AFI20190228BHEP

    Ipc: G06F 16/25 20190101ALI20190228BHEP

    INTG Intention to grant announced

    Effective date: 20190403

    GRAS Grant fee paid

    Free format text: ORIGINAL CODE: EPIDOSNIGR3

    GRAA (expected) grant

    Free format text: ORIGINAL CODE: 0009210

    STAA Information on the status of an ep patent application or granted ep patent

    Free format text: STATUS: THE PATENT HAS BEEN GRANTED

    AK Designated contracting states

    Kind code of ref document: B1

    Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

    REG Reference to a national code

    Ref country code: GB

    Ref legal event code: FG4D

    REG Reference to a national code

    Ref country code: CH

    Ref legal event code: EP

    Ref country code: AT

    Ref legal event code: REF

    Ref document number: 1186964

    Country of ref document: AT

    Kind code of ref document: T

    Effective date: 20191015

    REG Reference to a national code

    Ref country code: DE

    Ref legal event code: R096

    Ref document number: 602016021484

    Country of ref document: DE

    REG Reference to a national code

    Ref country code: IE

    Ref legal event code: FG4D

    REG Reference to a national code

    Ref country code: NL

    Ref legal event code: MP

    Effective date: 20191002

    REG Reference to a national code

    Ref country code: LT

    Ref legal event code: MG4D

    REG Reference to a national code

    Ref country code: AT

    Ref legal event code: MK05

    Ref document number: 1186964

    Country of ref document: AT

    Kind code of ref document: T

    Effective date: 20191002

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: ES

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: GR

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20200103

    Ref country code: NO

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20200102

    Ref country code: LT

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: AT

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: PL

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: NL

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: LV

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: SE

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: BG

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20200102

    Ref country code: FI

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: PT

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20200203

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: CZ

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: IS

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20200224

    Ref country code: RS

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: HR

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: AL

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    REG Reference to a national code

    Ref country code: DE

    Ref legal event code: R097

    Ref document number: 602016021484

    Country of ref document: DE

    PG2D Information on lapse in contracting state deleted

    Ref country code: IS

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: DK

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: EE

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: RO

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: IS

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20200202

    PLBE No opposition filed within time limit

    Free format text: ORIGINAL CODE: 0009261

    STAA Information on the status of an ep patent application or granted ep patent

    Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: SM

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: SK

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: IT

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    26N No opposition filed

    Effective date: 20200703

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: SI

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: MC

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    REG Reference to a national code

    Ref country code: CH

    Ref legal event code: PL

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: CH

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20200831

    Ref country code: LI

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20200831

    Ref country code: LU

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20200804

    REG Reference to a national code

    Ref country code: BE

    Ref legal event code: MM

    Effective date: 20200831

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: IE

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20200804

    Ref country code: BE

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20200831

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: TR

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: MT

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    Ref country code: CY

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: MK

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

    Effective date: 20191002

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: GB

    Payment date: 20230822

    Year of fee payment: 8

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: FR

    Payment date: 20230825

    Year of fee payment: 8

    Ref country code: DE

    Payment date: 20230821

    Year of fee payment: 8