US20160292255A1 - Hybrid data management system and method for managing large, varying datasets - Google Patents
Hybrid data management system and method for managing large, varying datasets Download PDFInfo
- Publication number
- US20160292255A1 US20160292255A1 US15/182,498 US201615182498A US2016292255A1 US 20160292255 A1 US20160292255 A1 US 20160292255A1 US 201615182498 A US201615182498 A US 201615182498A US 2016292255 A1 US2016292255 A1 US 2016292255A1
- Authority
- US
- United States
- Prior art keywords
- data
- data store
- store
- metadata
- preferred
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013523 data management Methods 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 title claims description 28
- 238000013500 data storage Methods 0.000 claims abstract description 6
- 238000007726 management method Methods 0.000 claims description 19
- 230000006870 function Effects 0.000 claims description 15
- 230000004044 response Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 description 17
- 230000010076 replication Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000000638 solvent extraction Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000001747 exhibiting effect Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 201000002266 mite infestation Diseases 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/282—Hierarchical databases, e.g. IMS, LDAP data stores or Lotus Notes
-
- G06F17/30589—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/907—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G06F17/30557—
-
- G06F17/30997—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/1737—Details of further file system functions for reducing power consumption or coping with limited storage space, e.g. in mobile devices
Definitions
- the present invention relates generally to data management systems, and more particularly to an integrated hybrid data management system for more efficient managing of large and varying datasets.
- ‘big data’ problem results from the vast volumes of data, much of which is generated at very high velocities and with widely varying formats and lengths.
- the term ‘big data’ refers to datasets that have grown so large that they are beyond the ability of commonly-used database management tools to capture, manage and process within a tolerable period of time. Such datasets can range from a few dozen terabytes to many petabytes of data, all within a single data set.
- ‘big data’ comprises billions of potentially non-uniform data objects that are generated daily, must be accessible at an instant, and yet must be stored reliably and cheaply for potentially long periods of time.
- NoSQL distributed storage systems
- database management systems include HBase, Cassandra, MongoDB, Hibari®, etc. While such databases do not provide the richness of traditional SQL databases, they are very efficient in storing and retrieving large volumes of data in a relatively cheap and reliable manner.
- NoSQL-based systems are also readily scalable in that heterogeneous servers can be added at any time to networked server clusters, followed by the data being automatically rebalanced and distributed without disruption to service.
- NoSQL-based systems must be optimized for specific data types.
- Cassandra is optimized to handle very fast writes of many small data items, but conversely performs relatively poorly when many large data items are written to the database.
- No prior art solution is optimal for vastly different data types.
- a hybrid data storage management system for storing an incoming data object including metadata having first preferred predefined characteristics and raw data having second preferred predefined characteristics
- the system comprising: a plurality of data stores including at least a first data store and a second data store different from the first data store, wherein each of the plurality of data stores is associated with a preferred data store type corresponding to a type of data store whose storage method permits the associated data store to operate more efficiently on data having preferred predefined characteristics associated with the data store type than on data not having the preferred predefined characteristics, and wherein the first data store is a first preferred data store having a first preferred data store type corresponding to first preferred predefined characteristics, and the second data store is a second preferred data store having a second preferred data store type corresponding to second preferred predefined characteristics; and a routing layer coupled to the plurality of data stores, wherein the routing layer is configured to: receive, from an external source a write request for the incoming data object; determine that the metadata and the raw data of the incoming data object have the
- FIGS. 1A-1B illustrates two exemplary hybrid database management systems for managing large and varying datasets, in accordance with the principles of the invention
- FIG. 2 is a process for implementing a hybrid database management system for managing large and varying datasets, in accordance with the principles of the invention.
- FIG. 3 is one example of the process of FIG. 2 above for implementing a hybrid database management system that more efficiently stores and manages both small and large data datasets.
- the present disclosure relates generally to a hybrid data management/storage system which is comprised of two or more integrated data management systems.
- Metadata is used to link the data in a first data management system (e.g., small data store, such as NoSQL database) with the data in at least one additional connected data management system (e.g., large data store, such as an OS file system).
- a first data management system e.g., small data store, such as NoSQL database
- additional connected data management system e.g., large data store, such as an OS file system
- the above metadata linkage may allow the first data management system to share all the same algorithms for data management in a distributed system, e.g., partitioning, replication, migration (in the case of scale-out), repair (in the case of recovery from a failure), backup, etc., with the one or more connected additional data management systems, and therefore leverage the benefits of each since different types of data management system may tend to operate more efficiently on certain types/sizes of data than on other types/sizes, but without the complexities of using different data management algorithms in each type of system.
- partitioning e.g., partitioning, replication, migration (in the case of scale-out), repair (in the case of recovery from a failure), backup, etc.
- another aspect of the invention is to determine one or more characteristics of an incoming data object and, based on the presence of such characteristics, direct the underlying raw data of the incoming object to one of the connected database management system that is most suited or configured for the storage and management of such data.
- some data stores may contain additional indexing/searching functionality that is specific to a particular data type and, therefore, could store and mange such data in a more efficient manner than other data stores.
- raw data of an incoming object may be simultaneously stored in more than one data store.
- an email object might be stored in one store optimized for mail store and retrieval, and in another store simultaneously that is optimized for indexing and searching.
- the metadata corresponding to such data may then reflect the linkage to both such data stores.
- such characteristics may include the data object's size and/or data type (media data versus text data). Additional characteristics may include access pattern information corresponding to the access pattern or quality of service for the data object, such as the fact that the data object will be rarely accessed or modified, versus frequently modified, or that the incoming data object is from a user that is a different price plan so it can be stored in slower/cheaper data store. Additional object characteristics may further include strong or weak consistency (write to return after all replicas updated, or after some replicas updated), as well as remote versus local (application specifies this data object must be replicated to at least one remote data center).
- some databases may be better optimized for the storage of documents having a particular structure (e.g., JavaScript Object Notation).
- a particular structure e.g., JavaScript Object Notation
- Such a database may tend to perform less efficiently when storing data that is not of the preferred structure.
- the present invention would allow such a database to receive the data object, recognize that it is of an undesirable structure, and pass it to a separate, linked database that is more suited for the type of received data object.
- Another aspect of the invention is to enable an external source, such as an application or user, to interface with only a single data store.
- This initial or first data store may use the same metadata structure for all data objects, while selectively storing the underlying raw data forming the objects in one of the plurality of connected data management systems.
- the management functions from this first data store (where all metadata is stored since it is of a data size that is most suitable for storage in the first data store) can be used to manage a plurality of additional connected data stores, rather than having each of the separate data stores rely on their own management functions. Since the plurality of connected data stores are able to piggyback off of the management functions of the first data store, the only data store to which the user/application interfaces, the overall complexity of the system can be greatly reduced.
- the terms “a” or “an” shall mean one or more than one.
- the term “plurality” shall mean two or more than two.
- the term “another” is defined as a second or more.
- the terms “including” and/or “having” are open ended (e.g., comprising).
- Reference throughout this document to “one embodiment”, “certain embodiments”, “an embodiment” or similar term means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of such phrases in various places throughout this specification are not necessarily all referring to the same embodiment.
- the particular features, structures, or characteristics may be combined in any suitable manner on one or more embodiments without limitation.
- the elements of the invention are essentially the code segments to perform the necessary tasks.
- the code segments can be stored in a processor readable medium or transmitted by a computer data signal.
- the “processor readable medium” may include any medium that can store information. Examples of the processor readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory or other non-volatile memory, a floppy diskette, a CD-ROM, an optical disk, a hard disk, etc.
- server means a functionally-related group of electrical components (e.g., processor, memory, network interface, etc.) such as a computer system in a networked environment which may include both hardware and software components, or alternatively only the software components that, when executed, carry out certain functions.
- the “server” may be further integrated with one or more database management systems with comprise one or more associated databases.
- database management system means one or more computer program that control the creation, maintenance, and the use of an integrated collection of data records, files, and other data objects which are stored on processor readable media.
- the database management system is usable by external sources (e.g., applications/users) to access and manipulate the stored data, as well as enforce data integrity, security, manage concurrent accesses, and recover/restore data.
- FIG. 1A depicts one embodiment of a hybrid data management systems 100 configured to implement one or more aspects of the invention.
- the system 100 is comprised of one or more servers that are accessible by an external source in the form of application/user 110 which is configured to interface with a first database management system—the first data store 120 , which may be a NoSQL-based database system.
- the first data store 120 may be similarly configured to communicate with a second database management system—the second data store 130 , which may comprise a file system that is optimized for storage of large data blocks (e.g., Linux ext3, Linux ext4, Hadoop Distributed File System, etc.).
- hybrid data management systems 100 is comprised of only two data stores, it should be appreciated that N additional data stores may be similarly connected to the first data store 120 , such as is the case in the exemplary system described below with reference to FIG. 1B .
- file systems may be any local file system having basic data management features for distribution, replication, etc, as well as simple block data storage systems such as CDROM, DVD, magnetic tape system, etc.
- the first data store 120 may be configured to provide built-in functionality for data partitioning, automatic replication, incremental backups, node expansion, quorum calculation, etc. It should be appreciated that the first data store 120 may be implemented as a NoSQL-type database such that, for smaller data sizes, it is able to provide higher performance due, for example, to the efficient write and read paths using a write-ahead log, in-memory cache, and other features.
- One aspect of the invention is to recognize that data objects having certain characteristics, such as the size or type of the data objects, may be more efficiently stored and managed by a different type of data store, such as the second large data store 130 .
- the invention provides a hybrid database solution which improves overall performance by storing certain data objects (e.g., small data objects) in the first data store 120 (e.g., a NoSQL-type database), while moving the raw data of other types of data objects (e.g., large data objects) into the second data store 130 (e.g., file system).
- the threshold of what comprises small data objects versus large data objects may depend, at least in part, on the particular system hardware and workload.
- the threshold may be a user-definable property of the system 100 .
- typical databases for storing and managing social networking data tend to be optimized for data objects in the 1K to 10K range.
- a data object that is greater than 100K may be considered large.
- databases optimized for email data tend to store data in the 1K to 100K as a normal case.
- a data object of greater than 1M may be considered “large”, such as those emails with large file attachments.
- these ranges are likely to evolve as well. However, the problems of attempting to store and manage disparate data on a particular database will likely persist.
- the first data store 120 may further be configured to determine how particular incoming data (e.g., from application/user 110 ) should be stored in the envisioned distributed manner.
- a routing layer 140 may first determine, based on a comparison of the size of an incoming data object to a threshold value, that the incoming data object should be stored in the first data store 120 , or alternatively in a file system that is comprised of the second data store 130 .
- the routing layer 140 may be implemented as a proxy layer, it may equally be implemented in numerous other forms of decision logic, either in the form of software, hardware or a combination thereof.
- Metadata for the local data objects 150 may similarly be stored as separate metadata 160 , as shown in FIG. 1A . While in certain embodiments the metadata for the local data objects 150 may be stored separately as metadata 160 , in other embodiments such metadata may be stored together with the underlying raw data, as local data objects 150 . Such metadata may be referred to as local-object metadata since the information corresponds to an object which has been stored locally, i.e., in the first data store.
- the routing layer 140 determines that the particular incoming data object has one or more particular characteristics for which the second data store 130 is better configured (e.g., larger than the predefined threshold value), then the raw data of the incoming object would be passed to the second data store 130 and stored as raw data 170 .
- Metadata for each such incoming data object whose raw data is stored in the second data store 130 may nonetheless be stored by the first data store 120 as metadata 160 .
- metadata may be referred to as remote-object metadata since the information corresponds to an object which has been stored remotely, i.e., in the second data store.
- metadata 160 may comprise descriptive information for such large data objects, and may further include associative information that links a particular metadata entry with the corresponding raw data 170 to which it pertains.
- Each such metadata entry may include, for example, content-type, access control list, etc.
- the size of the metadata per object should preferably be small, such as on the order of a few hundred bytes.
- the above-referenced remote-object metadata may be described as a placeholder object such that, when management and/or access operations are performed on it in the first data store 120 , the data management system 100 automatically undertakes a corresponding operation on the associated raw data in whichever data store it is stored. In this fashion, only the data management functions of the first data store 120 need be used. However, it may be the case that utilizing the data management functions of the first data store 120 may result in some unintended negative impact on performance. For example, in the event that the first data store 120 writes all updates to data objects in a RAM and disk cache, the RAM and disk cache may fill up quickly when many big data objects are to be written. In such cases, it is a further aspect of the invention to selectively bypass the data management function of the first data store 120 , and instead directly utilize the corresponding functions (i.e., read/write) of the particular data store at issue.
- FIG. 1B depicted is another example of the hybrid data management systems 100 configured with a plurality of data stores 130 1-n , in addition to the first data store 120 with which the application/user 110 interfaces directed. All of the description set forth above with respect to FIG. 1A is hereby incorporated and applicable to the example of FIG. 1B .
- the first data store 120 may be configured to provide built-in functionality for data partitioning, automatic replication, incremental backups, node expansion, quorum calculation, and may be implemented as a NoSQL-type database such that, for smaller data sizes, it is able to provide higher performance.
- the first data store 120 may further be configured to determine how particular incoming data (e.g., from application/user 110 ) should be stored based on the incoming data object having one or more recognized characteristics.
- the routing layer 140 may be configured to first determine certain characteristics of the incoming data object. The routing layer may determine which of the available data stores should be used to store the raw data for the incoming object. As previously mentioned, this determination may be based on identifying which of the available data stores is best configured to store and manage data having the one or more determined characteristics.
- Metadata corresponding to the incoming data object may then be stored preferably in the first data store 120 and without regard to which of the plurality of data stores (e.g., first data store 120 and plurality of additional data stores 130 1-n ) was used to store the corresponding raw data.
- the plurality of data stores e.g., first data store 120 and plurality of additional data stores 130 1-n
- FIGS. 1A-1B set forth two exemplary system configurations for implementing hybrid data management systems in accordance with the principles of the invention, it should further be appreciated that other known or obvious design variations are equally envisioned and within the scope of the disclosure.
- process 200 begins at block 210 when a hybrid data management system (e.g., system 100 of FIGS. 1A-1B ) receives a write request from an external source (e.g., application/user 100 of FIGS. 1A-1B ), such as in the form of a ‘PUT’ object operation.
- an external source e.g., application/user 100 of FIGS. 1A-1B
- object operations may be in any protocol, such as S3 or HTTP.
- the incoming object may have one or more predetermined characteristic which may be detected/determined before the object is written/stored by the system. To that end, at block 220 of process 200 a determination may be made as to which of N possible predefined characteristics the incoming data object may have. In certain embodiments, one of the predefined characteristics may be a particular size range, data type, frequency or pattern of accesses/modifications, quality of service, etc.
- process 200 may continue to block 230 where the process may then identify which of a plurality of connected data stores (e.g., first data store 120 , plurality of data stores 130 1-n , etc.) would be preferable for storage of the incoming data object's raw data. In certain embodiments, this determination may be based on identifying which of the available data stores is more optimally configured (optimized) to store data exhibiting the determined characteristic(s) from block 220 .
- a plurality of connected data stores e.g., first data store 120 , plurality of data stores 130 1-n , etc.
- Process 200 may then continue to block 240 where the incoming data object's raw data may then be routed to the identified preferred available data store.
- metadata associated with the incoming data object may be stored in the first data store.
- metadata may be stored in the first data store without regard to whether the incoming data object's raw data was stored in the first data store or in any of the other available data stores.
- metadata may comprise associative information linking a particular metadata entry with the corresponding data object's raw data (e.g., location information in the form of a URL, path name, ID, etc.).
- the reference metadata may further include information about the type of data in the corresponding data object, size, name, owner, last modified time, access control rules, access statistics, etc.
- process 200 may then continue to block 260 where the process operates (read, move, delete, modify etc.) on raw data stored in the Nth data store in response to detecting an attempted operation on the associated metadata that was previously stored in the first data store.
- process operates (read, move, delete, modify etc.) on raw data stored in the Nth data store in response to detecting an attempted operation on the associated metadata that was previously stored in the first data store.
- process 300 of FIG. 3 represents a more specific example of the process 200 of FIG. 2 in which the predefined characteristic is a size threshold value that functionally distinguishes small data objects from large data objects.
- the predefined characteristic is a size threshold value that functionally distinguishes small data objects from large data objects.
- the first characteristic could be represented as an object size range of 0 up to the threshold value
- the second characteristic could be represented as an object size range that begins at the threshold value up to a system-imposed maximum object size.
- the concept of what constitutes large data versus what constitutes small data may be system- or application-specific. As such, the invention is not predicated on particular data sizes. Rather, all that is required is a user- or system-definable characteristic, such as a threshold value, be used to differentiate when data will be treated as large data and when it will be treated as small data.
- a user- or system-definable characteristic such as a threshold value
- process 300 begins at block 310 when a hybrid data management system (e.g., system 100 of FIGS. 1A-1B ) receives a write request from an external source (e.g., application/user 100 of FIGS. 1A-1B ), such as in the form of a ‘PUT’ object operation.
- the incoming object may have an associated content-length header that includes the overall size of the incoming object that is to be written to the database.
- a routing layer e.g., routing layer 140 of FIGS. 1A-1B
- the content-length may be compared to the predetermined, user-definable threshold value (block 320 ). If the content-length is less than (or even equal to) the threshold value, process 300 may continue to block 330 where the incoming data object is stored in the small data store.
- metadata for the incoming object may similarly be stored in the small data store.
- Metadata for the incoming larger object may be stored in the small data store at block 340 , while the underlying raw data of the object is passed to and stored in a large data store (block 350 ). It should additionally be appreciated that the metadata for the large object stored at block 340 may further include associative information indicating that the underlying raw data is in fact being stored in the large data store.
- a function call to the large data store may be used by the routing layer to determine which node(s) in the large data store should be written to.
- the raw data for the incoming object may be written to the large data store as a single file in a configured directory.
- each data part may be stored as a separate file. Changes to the threshold value may only affect newly-incoming data object requests, and not otherwise affect the location of already-stored objects.
- the large data store may be configured with a top-level directory which contains one or more group-level subdirectories for each of a number of defined groups. Within each group-level directory there may be one or more additional user-level subdirectories for each registered user. In each user-level subdirectory, an indirection layer may be used to designate or otherwise identify each stored object. By way of example, a 2-letter prefix of the hash value of the object in question may be used
- the proxy layer may be configured to first retrieve the requested object's metadata stored in the small data store.
- the retrieved metadata will indicate if the requested object is in fact stored in the small data store, or has been stored in the large data store due to its size.
- the requested object may then be read from the identified location.
- the proxy layer may further be configured to first retrieve object's metadata stored in the small data store. Again, the retrieved metadata will indicate if the object is stored in the small data store or in the large data store. The identified object may then be deleted from the identified location.
- the raw data when a node is added, removed or its key range changed, the raw data must be moved from one node to another.
- the data to be moved may first be identified by the associated key range, then the raw data may be streamed from the source node(s) to the destination node(s).
- the large data store e.g., large data store 130
- the associated metadata being stored by the small data store e.g., small data store 120
- the small data store may be similarly read and updated to reflect the moved data's new location.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- This application is a continuation of prior U.S. application Ser. No. 13/156,502, filed on Jun. 9, 2011, the disclosures of which are hereby incorporated by reference herein.
- The present invention relates generally to data management systems, and more particularly to an integrated hybrid data management system for more efficient managing of large and varying datasets.
- While the rise of the Internet has solved some data management problems, at the same time it has created some new ones as well. For example, many Internet applications, such as e-commerce, e-mail, and social media applications, have created a so-called ‘big data’ problem. The ‘big data’ problem results from the vast volumes of data, much of which is generated at very high velocities and with widely varying formats and lengths. In general, the term ‘big data’ refers to datasets that have grown so large that they are beyond the ability of commonly-used database management tools to capture, manage and process within a tolerable period of time. Such datasets can range from a few dozen terabytes to many petabytes of data, all within a single data set. Thus, ‘big data’ comprises billions of potentially non-uniform data objects that are generated daily, must be accessible at an instant, and yet must be stored reliably and cheaply for potentially long periods of time.
- A new class of distributed storage systems, called NoSQL or ‘big data’ databases, has recently emerged. Examples of such database management systems include HBase, Cassandra, MongoDB, Hibari®, etc. While such databases do not provide the richness of traditional SQL databases, they are very efficient in storing and retrieving large volumes of data in a relatively cheap and reliable manner. Such NoSQL-based systems are also readily scalable in that heterogeneous servers can be added at any time to networked server clusters, followed by the data being automatically rebalanced and distributed without disruption to service.
- However, in order to achieve such high performance and scalability, these NoSQL-based systems must be optimized for specific data types. For example, Cassandra is optimized to handle very fast writes of many small data items, but conversely performs relatively poorly when many large data items are written to the database. No prior art solution is optimal for vastly different data types.
- One potential solution would be to deploy different solutions for different data types; for example, store large data in a file system but keep small data objects in a NoSQL database. However, this approach is unsatisfactory since it multiplies the number of systems and software that must be maintained. Moreover, synchronizing usage across different databases is likely to be difficult, and enforcing a usage policy (say some bytes/second limit) for a user who happens to have both large and small data would require synchronizing two different systems in real time. It is also questionable if this approach would even function in a large scale ‘big data’ environment. This approach also does not readily scale to N systems since the management and synchronization overhead increases as N increases.
- Accordingly, there is a need for an integrated hybrid data management system which is capable of efficiently handling varying types of ‘big data.’
- Disclosed and claimed herein is a hybrid data storage management system for storing an incoming data object including metadata having first preferred predefined characteristics and raw data having second preferred predefined characteristics, the system comprising: a plurality of data stores including at least a first data store and a second data store different from the first data store, wherein each of the plurality of data stores is associated with a preferred data store type corresponding to a type of data store whose storage method permits the associated data store to operate more efficiently on data having preferred predefined characteristics associated with the data store type than on data not having the preferred predefined characteristics, and wherein the first data store is a first preferred data store having a first preferred data store type corresponding to first preferred predefined characteristics, and the second data store is a second preferred data store having a second preferred data store type corresponding to second preferred predefined characteristics; and a routing layer coupled to the plurality of data stores, wherein the routing layer is configured to: receive, from an external source a write request for the incoming data object; determine that the metadata and the raw data of the incoming data object have the first and second preferred predefined characteristics, respectively; identify the first and second preferred data store types corresponding to the metadata and the raw data, respectively, based on the first and second preferred predefined characteristics; route the raw data to the second data store for storage therein based on the raw data having been identified as corresponding to the second data store type; and store the metadata in the first data store based on the metadata having been identified as corresponding to the first data store type, and without regard to which of the plurality of data stores is the second preferred data store corresponding to the raw data of the incoming data object, the metadata including associative information linking the metadata with the corresponding raw data in the second data store, wherein the first data store and the second data store utilize different storage methods, such that, by virtue of a first data store storage method, the first data store operates more efficiently on the metadata than the second data store would, and, by virtue of a second data store storage method, the second data store operates more efficiently on the raw data than the first data store would.
- Other aspects, features, and techniques of the invention will be apparent to one skilled in the relevant art in view of the following description of the exemplary embodiments of the invention.
- The features, objects, and advantages of the present invention will become more apparent from the detailed description set forth below when taken in conjunction with the drawings in which like reference characters identify correspondingly throughout and wherein:
-
FIGS. 1A-1B illustrates two exemplary hybrid database management systems for managing large and varying datasets, in accordance with the principles of the invention; -
FIG. 2 is a process for implementing a hybrid database management system for managing large and varying datasets, in accordance with the principles of the invention; and -
FIG. 3 is one example of the process ofFIG. 2 above for implementing a hybrid database management system that more efficiently stores and manages both small and large data datasets. - The present disclosure relates generally to a hybrid data management/storage system which is comprised of two or more integrated data management systems. Metadata is used to link the data in a first data management system (e.g., small data store, such as NoSQL database) with the data in at least one additional connected data management system (e.g., large data store, such as an OS file system).
- The above metadata linkage may allow the first data management system to share all the same algorithms for data management in a distributed system, e.g., partitioning, replication, migration (in the case of scale-out), repair (in the case of recovery from a failure), backup, etc., with the one or more connected additional data management systems, and therefore leverage the benefits of each since different types of data management system may tend to operate more efficiently on certain types/sizes of data than on other types/sizes, but without the complexities of using different data management algorithms in each type of system. To that end, another aspect of the invention is to determine one or more characteristics of an incoming data object and, based on the presence of such characteristics, direct the underlying raw data of the incoming object to one of the connected database management system that is most suited or configured for the storage and management of such data. For example, some data stores may contain additional indexing/searching functionality that is specific to a particular data type and, therefore, could store and mange such data in a more efficient manner than other data stores.
- It should further be appreciated that raw data of an incoming object may be simultaneously stored in more than one data store. For example, an email object might be stored in one store optimized for mail store and retrieval, and in another store simultaneously that is optimized for indexing and searching. The metadata corresponding to such data may then reflect the linkage to both such data stores.
- With respect to the data characteristics that may be considered, such characteristics may include the data object's size and/or data type (media data versus text data). Additional characteristics may include access pattern information corresponding to the access pattern or quality of service for the data object, such as the fact that the data object will be rarely accessed or modified, versus frequently modified, or that the incoming data object is from a user that is a different price plan so it can be stored in slower/cheaper data store. Additional object characteristics may further include strong or weak consistency (write to return after all replicas updated, or after some replicas updated), as well as remote versus local (application specifies this data object must be replicated to at least one remote data center).
- For example, by way of providing a non-limiting example, some databases (e.g., MongoDB) may be better optimized for the storage of documents having a particular structure (e.g., JavaScript Object Notation). Such a database may tend to perform less efficiently when storing data that is not of the preferred structure. In such cases, the present invention would allow such a database to receive the data object, recognize that it is of an undesirable structure, and pass it to a separate, linked database that is more suited for the type of received data object.
- Another aspect of the invention is to enable an external source, such as an application or user, to interface with only a single data store. This initial or first data store may use the same metadata structure for all data objects, while selectively storing the underlying raw data forming the objects in one of the plurality of connected data management systems. In this fashion, the management functions from this first data store (where all metadata is stored since it is of a data size that is most suitable for storage in the first data store) can be used to manage a plurality of additional connected data stores, rather than having each of the separate data stores rely on their own management functions. Since the plurality of connected data stores are able to piggyback off of the management functions of the first data store, the only data store to which the user/application interfaces, the overall complexity of the system can be greatly reduced.
- As used herein, the terms “a” or “an” shall mean one or more than one. The term “plurality” shall mean two or more than two. The term “another” is defined as a second or more. The terms “including” and/or “having” are open ended (e.g., comprising). Reference throughout this document to “one embodiment”, “certain embodiments”, “an embodiment” or similar term means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of such phrases in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner on one or more embodiments without limitation. The term “or” as used herein is to be interpreted as inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C”. An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.
- In accordance with the practices of persons skilled in the art of computer programming, the invention is described below with reference to operations that are performed by a computer system or a like electronic system. Such operations are sometimes referred to as being computer-executed. It will be appreciated that operations that are symbolically represented include the manipulation by a processor, such as a central processing unit, of electrical signals representing data bits and the maintenance of data bits at memory locations, such as in system memory, as well as other processing of signals. The memory locations where data bits are maintained are physical locations that have particular electrical, magnetic, optical, or organic properties corresponding to the data bits.
- When implemented in software, the elements of the invention are essentially the code segments to perform the necessary tasks. The code segments can be stored in a processor readable medium or transmitted by a computer data signal. The “processor readable medium” may include any medium that can store information. Examples of the processor readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory or other non-volatile memory, a floppy diskette, a CD-ROM, an optical disk, a hard disk, etc.
- The term “server” means a functionally-related group of electrical components (e.g., processor, memory, network interface, etc.) such as a computer system in a networked environment which may include both hardware and software components, or alternatively only the software components that, when executed, carry out certain functions. The “server” may be further integrated with one or more database management systems with comprise one or more associated databases.
- The term “database management system” means one or more computer program that control the creation, maintenance, and the use of an integrated collection of data records, files, and other data objects which are stored on processor readable media. The database management system is usable by external sources (e.g., applications/users) to access and manipulate the stored data, as well as enforce data integrity, security, manage concurrent accesses, and recover/restore data.
-
FIG. 1A depicts one embodiment of a hybriddata management systems 100 configured to implement one or more aspects of the invention. In the example ofFIG. 1A , thesystem 100 is comprised of one or more servers that are accessible by an external source in the form of application/user 110 which is configured to interface with a first database management system—thefirst data store 120, which may be a NoSQL-based database system. Moreover, thefirst data store 120 may be similarly configured to communicate with a second database management system—thesecond data store 130, which may comprise a file system that is optimized for storage of large data blocks (e.g., Linux ext3, Linux ext4, Hadoop Distributed File System, etc.). While the hybriddata management systems 100 is comprised of only two data stores, it should be appreciated that N additional data stores may be similarly connected to thefirst data store 120, such as is the case in the exemplary system described below with reference toFIG. 1B . In any event, such file systems may be any local file system having basic data management features for distribution, replication, etc, as well as simple block data storage systems such as CDROM, DVD, magnetic tape system, etc. - The
first data store 120 may be configured to provide built-in functionality for data partitioning, automatic replication, incremental backups, node expansion, quorum calculation, etc. It should be appreciated that thefirst data store 120 may be implemented as a NoSQL-type database such that, for smaller data sizes, it is able to provide higher performance due, for example, to the efficient write and read paths using a write-ahead log, in-memory cache, and other features. - One aspect of the invention is to recognize that data objects having certain characteristics, such as the size or type of the data objects, may be more efficiently stored and managed by a different type of data store, such as the second
large data store 130. In this fashion, the invention provides a hybrid database solution which improves overall performance by storing certain data objects (e.g., small data objects) in the first data store 120 (e.g., a NoSQL-type database), while moving the raw data of other types of data objects (e.g., large data objects) into the second data store 130 (e.g., file system). - When the data object characteristic under consideration is the object's size, it should be appreciated that the threshold of what comprises small data objects versus large data objects may depend, at least in part, on the particular system hardware and workload. Thus, the threshold may be a user-definable property of the
system 100. For example, typical databases for storing and managing social networking data tend to be optimized for data objects in the 1K to 10K range. For such systems, a data object that is greater than 100K may be considered large. Similarly, databases optimized for email data tend to store data in the 1K to 100K as a normal case. For such databases, a data object of greater than 1M may be considered “large”, such as those emails with large file attachments. As technology evolves, these ranges are likely to evolve as well. However, the problems of attempting to store and manage disparate data on a particular database will likely persist. - In certain embodiments, the
first data store 120 may further be configured to determine how particular incoming data (e.g., from application/user 110) should be stored in the envisioned distributed manner. By way of a non-limiting example, arouting layer 140 may first determine, based on a comparison of the size of an incoming data object to a threshold value, that the incoming data object should be stored in thefirst data store 120, or alternatively in a file system that is comprised of thesecond data store 130. Although therouting layer 140 may be implemented as a proxy layer, it may equally be implemented in numerous other forms of decision logic, either in the form of software, hardware or a combination thereof. - If the
routing layer 140 determines that a particular incoming data object has one or more particular characteristics for which thefirst data store 120 is preferably configured (e.g., smaller than a predefined threshold value), then the incoming object would be stored with thelocal objects 150. Metadata for the local data objects 150 may similarly be stored asseparate metadata 160, as shown inFIG. 1A . While in certain embodiments the metadata for the local data objects 150 may be stored separately asmetadata 160, in other embodiments such metadata may be stored together with the underlying raw data, as local data objects 150. Such metadata may be referred to as local-object metadata since the information corresponds to an object which has been stored locally, i.e., in the first data store. - If, however, the
routing layer 140 determines that the particular incoming data object has one or more particular characteristics for which thesecond data store 130 is better configured (e.g., larger than the predefined threshold value), then the raw data of the incoming object would be passed to thesecond data store 130 and stored asraw data 170. - Additionally, metadata for each such incoming data object whose raw data is stored in the second data store 130 (as raw data 170) may nonetheless be stored by the
first data store 120 asmetadata 160. Such metadata may be referred to as remote-object metadata since the information corresponds to an object which has been stored remotely, i.e., in the second data store. Moreover,such metadata 160 may comprise descriptive information for such large data objects, and may further include associative information that links a particular metadata entry with the correspondingraw data 170 to which it pertains. Each such metadata entry may include, for example, content-type, access control list, etc. The size of the metadata per object should preferably be small, such as on the order of a few hundred bytes. - The above-referenced remote-object metadata may be described as a placeholder object such that, when management and/or access operations are performed on it in the
first data store 120, thedata management system 100 automatically undertakes a corresponding operation on the associated raw data in whichever data store it is stored. In this fashion, only the data management functions of thefirst data store 120 need be used. However, it may be the case that utilizing the data management functions of thefirst data store 120 may result in some unintended negative impact on performance. For example, in the event that thefirst data store 120 writes all updates to data objects in a RAM and disk cache, the RAM and disk cache may fill up quickly when many big data objects are to be written. In such cases, it is a further aspect of the invention to selectively bypass the data management function of thefirst data store 120, and instead directly utilize the corresponding functions (i.e., read/write) of the particular data store at issue. - With reference now to
FIG. 1B , depicted is another example of the hybriddata management systems 100 configured with a plurality ofdata stores 130 1-n, in addition to thefirst data store 120 with which the application/user 110 interfaces directed. All of the description set forth above with respect toFIG. 1A is hereby incorporated and applicable to the example ofFIG. 1B . - As with the example of
FIG. 1A above, thefirst data store 120 may be configured to provide built-in functionality for data partitioning, automatic replication, incremental backups, node expansion, quorum calculation, and may be implemented as a NoSQL-type database such that, for smaller data sizes, it is able to provide higher performance. - As described above, the
first data store 120 may further be configured to determine how particular incoming data (e.g., from application/user 110) should be stored based on the incoming data object having one or more recognized characteristics. As such, therouting layer 140 may be configured to first determine certain characteristics of the incoming data object. The routing layer may determine which of the available data stores should be used to store the raw data for the incoming object. As previously mentioned, this determination may be based on identifying which of the available data stores is best configured to store and manage data having the one or more determined characteristics. - Additionally, metadata corresponding to the incoming data object may then be stored preferably in the
first data store 120 and without regard to which of the plurality of data stores (e.g.,first data store 120 and plurality of additional data stores 130 1-n) was used to store the corresponding raw data. - While
FIGS. 1A-1B set forth two exemplary system configurations for implementing hybrid data management systems in accordance with the principles of the invention, it should further be appreciated that other known or obvious design variations are equally envisioned and within the scope of the disclosure. - Referring now to
FIG. 2 , depicted is one embodiment of a process for managing large and varying datasets, in accordance with the principles of the invention. In particular,process 200 begins atblock 210 when a hybrid data management system (e.g.,system 100 ofFIGS. 1A-1B ) receives a write request from an external source (e.g., application/user 100 ofFIGS. 1A-1B ), such as in the form of a ‘PUT’ object operation. It should be appreciated that such object operations may be in any protocol, such as S3 or HTTP. - The incoming object may have one or more predetermined characteristic which may be detected/determined before the object is written/stored by the system. To that end, at
block 220 of process 200 a determination may be made as to which of N possible predefined characteristics the incoming data object may have. In certain embodiments, one of the predefined characteristics may be a particular size range, data type, frequency or pattern of accesses/modifications, quality of service, etc. - Once it is determined which of the predetermined N characteristics are present in the incoming data,
process 200 may continue to block 230 where the process may then identify which of a plurality of connected data stores (e.g.,first data store 120, plurality ofdata stores 130 1-n, etc.) would be preferable for storage of the incoming data object's raw data. In certain embodiments, this determination may be based on identifying which of the available data stores is more optimally configured (optimized) to store data exhibiting the determined characteristic(s) fromblock 220. -
Process 200 may then continue to block 240 where the incoming data object's raw data may then be routed to the identified preferred available data store. Then, atblock 250, metadata associated with the incoming data object may be stored in the first data store. In certain embodiments, such metadata may be stored in the first data store without regard to whether the incoming data object's raw data was stored in the first data store or in any of the other available data stores. Such metadata may comprise associative information linking a particular metadata entry with the corresponding data object's raw data (e.g., location information in the form of a URL, path name, ID, etc.). Additionally, the reference metadata may further include information about the type of data in the corresponding data object, size, name, owner, last modified time, access control rules, access statistics, etc. - Continuing to refer to
FIG. 2 ,process 200 may then continue to block 260 where the process operates (read, move, delete, modify etc.) on raw data stored in the Nth data store in response to detecting an attempted operation on the associated metadata that was previously stored in the first data store. In this fashion, ‘big data’ of varying types can be more efficiently stored and managed. - Referring now to
FIG. 3 , depicted is a particular embodiment of a process for managing large and varying datasets, in accordance with the principles of the invention. Specifically,process 300 ofFIG. 3 represents a more specific example of theprocess 200 ofFIG. 2 in which the predefined characteristic is a size threshold value that functionally distinguishes small data objects from large data objects. For example, the first characteristic could be represented as an object size range of 0 up to the threshold value, while the second characteristic could be represented as an object size range that begins at the threshold value up to a system-imposed maximum object size. - It should be appreciated that the concept of what constitutes large data versus what constitutes small data may be system- or application-specific. As such, the invention is not predicated on particular data sizes. Rather, all that is required is a user- or system-definable characteristic, such as a threshold value, be used to differentiate when data will be treated as large data and when it will be treated as small data.
- Similar to process 200 described above,
process 300 begins atblock 310 when a hybrid data management system (e.g.,system 100 ofFIGS. 1A-1B ) receives a write request from an external source (e.g., application/user 100 ofFIGS. 1A-1B ), such as in the form of a ‘PUT’ object operation. The incoming object may have an associated content-length header that includes the overall size of the incoming object that is to be written to the database. At a routing layer (e.g.,routing layer 140 ofFIGS. 1A-1B ), for example, the content-length may be compared to the predetermined, user-definable threshold value (block 320). If the content-length is less than (or even equal to) the threshold value,process 300 may continue to block 330 where the incoming data object is stored in the small data store. In certain embodiments, metadata for the incoming object may similarly be stored in the small data store. - If, on the other hand, it is determined at
block 320 that the content-length in fact exceeds the threshold, then metadata for the incoming larger object may be stored in the small data store atblock 340, while the underlying raw data of the object is passed to and stored in a large data store (block 350). It should additionally be appreciated that the metadata for the large object stored atblock 340 may further include associative information indicating that the underlying raw data is in fact being stored in the large data store. - A function call to the large data store may be used by the routing layer to determine which node(s) in the large data store should be written to. In certain embodiments, the raw data for the incoming object may be written to the large data store as a single file in a configured directory. However, in the case of multi-part data uploads to the hybrid database system, it should be appreciated that each data part may be stored as a separate file. Changes to the threshold value may only affect newly-incoming data object requests, and not otherwise affect the location of already-stored objects.
- With respect to the large data store, it may be preferable to avoid writing the large data objects to a single directory since the number of files may be relatively large, depending of course on how low the threshold value has been set. In order to maximize performance, the large data store may be configured with a top-level directory which contains one or more group-level subdirectories for each of a number of defined groups. Within each group-level directory there may be one or more additional user-level subdirectories for each registered user. In each user-level subdirectory, an indirection layer may be used to designate or otherwise identify each stored object. By way of example, a 2-letter prefix of the hash value of the object in question may be used
- In the event that a ‘GET’ object request is received by the hybrid database system, the proxy layer may be configured to first retrieve the requested object's metadata stored in the small data store. The retrieved metadata will indicate if the requested object is in fact stored in the small data store, or has been stored in the large data store due to its size. The requested object may then be read from the identified location.
- In the event that a ‘DELETE’ object request is received by the hybrid database system, the proxy layer may further be configured to first retrieve object's metadata stored in the small data store. Again, the retrieved metadata will indicate if the object is stored in the small data store or in the large data store. The identified object may then be deleted from the identified location.
- It should further be appreciated that when a node is added, removed or its key range changed, the raw data must be moved from one node to another. In the case of at least some NoSQL-type database systems, the data to be moved may first be identified by the associated key range, then the raw data may be streamed from the source node(s) to the destination node(s). In the event that the raw data to be moved is being stored by the large data store (e.g., large data store 130), the associated metadata being stored by the small data store (e.g., small data store 120) may be similarly read and updated to reflect the moved data's new location.
- While the invention has been described in connection with various embodiments, it should be understood that the invention is capable of further modifications. This application is intended to cover any variations, uses or adaptation of the invention following, in general, the principles of the invention, and including such departures from the present disclosure as come within the known and customary practice within the art to which the invention pertains.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/182,498 US9672267B2 (en) | 2011-06-09 | 2016-06-14 | Hybrid data management system and method for managing large, varying datasets |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/156,502 US9396290B2 (en) | 2011-06-09 | 2011-06-09 | Hybrid data management system and method for managing large, varying datasets |
US15/182,498 US9672267B2 (en) | 2011-06-09 | 2016-06-14 | Hybrid data management system and method for managing large, varying datasets |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/156,502 Continuation US9396290B2 (en) | 2011-06-09 | 2011-06-09 | Hybrid data management system and method for managing large, varying datasets |
Publications (2)
Publication Number | Publication Date |
---|---|
US20160292255A1 true US20160292255A1 (en) | 2016-10-06 |
US9672267B2 US9672267B2 (en) | 2017-06-06 |
Family
ID=47294053
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/156,502 Active 2032-06-11 US9396290B2 (en) | 2011-06-09 | 2011-06-09 | Hybrid data management system and method for managing large, varying datasets |
US15/182,498 Active US9672267B2 (en) | 2011-06-09 | 2016-06-14 | Hybrid data management system and method for managing large, varying datasets |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/156,502 Active 2032-06-11 US9396290B2 (en) | 2011-06-09 | 2011-06-09 | Hybrid data management system and method for managing large, varying datasets |
Country Status (4)
Country | Link |
---|---|
US (2) | US9396290B2 (en) |
EP (1) | EP2718858A4 (en) |
JP (1) | JP2012256324A (en) |
WO (1) | WO2013106079A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109600440A (en) * | 2018-12-13 | 2019-04-09 | 国网河北省电力有限公司石家庄供电分公司 | A kind of electric power sale big data processing method |
CN109885577A (en) * | 2019-03-11 | 2019-06-14 | Oppo广东移动通信有限公司 | Data processing method, device, terminal and storage medium |
US11513704B1 (en) | 2021-08-16 | 2022-11-29 | International Business Machines Corporation | Selectively evicting data from internal memory during record processing |
US11675513B2 (en) * | 2021-08-16 | 2023-06-13 | International Business Machines Corporation | Selectively shearing data when manipulating data during record processing |
Families Citing this family (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9087154B1 (en) | 2011-12-12 | 2015-07-21 | Crashlytics, Inc. | System and method for providing additional functionality to developer side application in an integrated development environment |
US9703680B1 (en) | 2011-12-12 | 2017-07-11 | Google Inc. | System and method for automatic software development kit configuration and distribution |
US9262250B2 (en) | 2011-12-12 | 2016-02-16 | Crashlytics, Inc. | System and method for data collection and analysis of information relating to mobile applications |
US9747128B1 (en) * | 2011-12-21 | 2017-08-29 | EMC IP Holding Company LLC | Worldwide distributed file system model |
US9286303B1 (en) * | 2011-12-22 | 2016-03-15 | Emc Corporation | Unified catalog service |
US9489233B1 (en) * | 2012-03-30 | 2016-11-08 | EMC IP Holding Company, LLC | Parallel modeling and execution framework for distributed computation and file system access |
US9053117B2 (en) * | 2012-04-11 | 2015-06-09 | 4Clicks Solutions, LLC | Storing application data with a unique ID |
US10044522B1 (en) * | 2012-08-21 | 2018-08-07 | Amazon Technologies Inc. | Tree-oriented configuration management service |
WO2014031618A2 (en) | 2012-08-22 | 2014-02-27 | Bitvore Corp. | Data relationships storage platform |
US9547682B2 (en) * | 2012-08-22 | 2017-01-17 | Bitvore Corp. | Enterprise data processing |
US9323767B2 (en) * | 2012-10-01 | 2016-04-26 | Longsand Limited | Performance and scalability in an intelligent data operating layer system |
WO2014133494A1 (en) * | 2013-02-27 | 2014-09-04 | Hitachi Data Systems Corporation | Multiple collections of user-defined metadata for self-describing objects |
US10078683B2 (en) | 2013-07-02 | 2018-09-18 | Jpmorgan Chase Bank, N.A. | Big data centralized intelligence system |
US10019483B2 (en) | 2013-07-30 | 2018-07-10 | Hitachi, Ltd. | Search system and search method |
US9355118B2 (en) | 2013-11-15 | 2016-05-31 | International Business Machines Corporation | System and method for intelligently categorizing data to delete specified amounts of data based on selected data characteristics |
GB2524074A (en) | 2014-03-14 | 2015-09-16 | Ibm | Processing data sets in a big data repository |
CN105205082A (en) * | 2014-06-27 | 2015-12-30 | 国际商业机器公司 | Method and system for processing file storage in HDFS |
US9767119B2 (en) | 2014-12-31 | 2017-09-19 | Netapp, Inc. | System and method for monitoring hosts and storage devices in a storage system |
US10127293B2 (en) | 2015-03-30 | 2018-11-13 | International Business Machines Corporation | Collaborative data intelligence between data warehouse models and big data stores |
US10318491B1 (en) | 2015-03-31 | 2019-06-11 | EMC IP Holding Company LLC | Object metadata query with distributed processing systems |
US11016946B1 (en) * | 2015-03-31 | 2021-05-25 | EMC IP Holding Company LLC | Method and apparatus for processing object metadata |
US9787772B2 (en) * | 2015-05-19 | 2017-10-10 | Netapp, Inc. | Policy based alerts for networked storage systems |
US10133759B1 (en) * | 2015-09-15 | 2018-11-20 | Amazon Technologies, Inc. | System for determining storage or output of data objects |
US10762069B2 (en) * | 2015-09-30 | 2020-09-01 | Pure Storage, Inc. | Mechanism for a system where data and metadata are located closely together |
US10423586B2 (en) | 2016-03-17 | 2019-09-24 | Wipro Limited | Method and system for synchronization of relational database management system to non-structured query language database |
US10671636B2 (en) * | 2016-05-18 | 2020-06-02 | Korea Electronics Technology Institute | In-memory DB connection support type scheduling method and system for real-time big data analysis in distributed computing environment |
US10572506B2 (en) * | 2017-03-07 | 2020-02-25 | Salesforce.Com, Inc. | Synchronizing data stores for different size data objects |
US10817203B1 (en) | 2017-08-29 | 2020-10-27 | Amazon Technologies, Inc. | Client-configurable data tiering service |
US11151081B1 (en) | 2018-01-03 | 2021-10-19 | Amazon Technologies, Inc. | Data tiering service with cold tier indexing |
US10579597B1 (en) | 2018-01-09 | 2020-03-03 | Amazon Technologies, Inc. | Data-tiering service with multiple cold tier quality of service levels |
US10592139B2 (en) * | 2018-05-30 | 2020-03-17 | EMC IP Holding Company LLC | Embedded object data storage determined by object size information |
US11269688B2 (en) * | 2018-12-18 | 2022-03-08 | EMC IP Holding Company LLC | Scaling distributed computing system resources based on load and trend |
US11221782B1 (en) | 2019-03-27 | 2022-01-11 | Amazon Technologies, Inc. | Customizable progressive data-tiering service |
US11494611B2 (en) | 2019-07-31 | 2022-11-08 | International Business Machines Corporation | Metadata-based scientific data characterization driven by a knowledge database at scale |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6324581B1 (en) | 1999-03-03 | 2001-11-27 | Emc Corporation | File server system using file system storage, data movers, and an exchange of meta data among data movers for file locking and direct access to shared file systems |
CA2458908A1 (en) | 2001-08-31 | 2003-03-13 | Arkivio, Inc. | Techniques for storing data based upon storage policies |
US7177883B2 (en) * | 2004-07-15 | 2007-02-13 | Hitachi, Ltd. | Method and apparatus for hierarchical storage management based on data value and user interest |
US8600948B2 (en) * | 2005-09-15 | 2013-12-03 | Emc Corporation | Avoiding duplicative storage of managed content |
US7716180B2 (en) * | 2005-12-29 | 2010-05-11 | Amazon Technologies, Inc. | Distributed storage system with web services client interface |
US7743023B2 (en) | 2006-02-01 | 2010-06-22 | Microsoft Corporation | Scalable file replication and web-based access |
JP5584910B2 (en) | 2006-05-23 | 2014-09-10 | ノーリャン・ホールディング・コーポレイション | Distributed storage |
US20080021865A1 (en) | 2006-07-20 | 2008-01-24 | International Business Machines Corporation | Method, system, and computer program product for dynamically determining data placement |
US8701010B2 (en) * | 2007-03-12 | 2014-04-15 | Citrix Systems, Inc. | Systems and methods of using the refresh button to determine freshness policy |
US20100313044A1 (en) * | 2009-06-03 | 2010-12-09 | Microsoft Corporation | Storage array power management through i/o redirection |
US20100333116A1 (en) | 2009-06-30 | 2010-12-30 | Anand Prahlad | Cloud gateway system for managing data storage to cloud storage sites |
US20110072489A1 (en) | 2009-09-23 | 2011-03-24 | Gilad Parann-Nissany | Methods, devices, and media for securely utilizing a non-secured, distributed, virtualized network resource with applications to cloud-computing security and management |
-
2011
- 2011-06-09 US US13/156,502 patent/US9396290B2/en active Active
-
2012
- 2012-06-06 JP JP2012128578A patent/JP2012256324A/en active Pending
- 2012-06-08 EP EP12865079.3A patent/EP2718858A4/en not_active Withdrawn
- 2012-06-08 WO PCT/US2012/041518 patent/WO2013106079A1/en active Application Filing
-
2016
- 2016-06-14 US US15/182,498 patent/US9672267B2/en active Active
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109600440A (en) * | 2018-12-13 | 2019-04-09 | 国网河北省电力有限公司石家庄供电分公司 | A kind of electric power sale big data processing method |
CN109885577A (en) * | 2019-03-11 | 2019-06-14 | Oppo广东移动通信有限公司 | Data processing method, device, terminal and storage medium |
US11513704B1 (en) | 2021-08-16 | 2022-11-29 | International Business Machines Corporation | Selectively evicting data from internal memory during record processing |
US11675513B2 (en) * | 2021-08-16 | 2023-06-13 | International Business Machines Corporation | Selectively shearing data when manipulating data during record processing |
Also Published As
Publication number | Publication date |
---|---|
US20120317155A1 (en) | 2012-12-13 |
EP2718858A4 (en) | 2015-08-05 |
US9672267B2 (en) | 2017-06-06 |
US9396290B2 (en) | 2016-07-19 |
JP2012256324A (en) | 2012-12-27 |
WO2013106079A1 (en) | 2013-07-18 |
EP2718858A1 (en) | 2014-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9672267B2 (en) | Hybrid data management system and method for managing large, varying datasets | |
US12061623B2 (en) | Selective synchronization of content items in a content management system | |
US10430398B2 (en) | Data storage system having mutable objects incorporating time | |
US11789976B2 (en) | Data model and data service for content management system | |
US20220377112A1 (en) | Data loss prevention (dlp) for cloud resources via metadata analysis | |
US8510499B1 (en) | Solid state drive caching using memory structures to determine a storage space replacement candidate | |
KR102119258B1 (en) | Technique for implementing change data capture in database management system | |
WO2011108021A1 (en) | File level hierarchical storage management system, method, and apparatus | |
CA2910211A1 (en) | Object storage using multiple dimensions of object information | |
KR20200056357A (en) | Technique for implementing change data capture in database management system | |
US9910968B2 (en) | Automatic notifications for inadvertent file events | |
US8090925B2 (en) | Storing data streams in memory based on upper and lower stream size thresholds | |
CN111796767B (en) | Distributed file system and data management method | |
US20240330259A1 (en) | Data model and data service for content management system | |
US11803652B2 (en) | Determining access changes | |
CN116561358A (en) | Unified 3D scene data file storage and retrieval method based on hbase | |
US11799958B2 (en) | Evaluating access based on group membership | |
AU2021409880B2 (en) | Data model and data service for content management system | |
US12001574B2 (en) | Evaluating an access control list from permission statements |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: CLOUDIAN HOLDINGS, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:OGASAWARA, GARY HAYATO;TSO, MICHAEL M.;REEL/FRAME:039945/0673 Effective date: 20160725 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
AS | Assignment |
Owner name: HERCULES CAPITAL, INC., AS AGENT, CALIFORNIA Free format text: SECURITY INTEREST;ASSIGNOR:CLOUDIAN HOLDINGS INC.;REEL/FRAME:047426/0441 Effective date: 20181106 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2551); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Year of fee payment: 4 |
|
AS | Assignment |
Owner name: CLOUDIAN HOLDINGS INC., CALIFORNIA Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:HERCULES CAPITAL, INC., AS AGENT;REEL/FRAME:054881/0403 Effective date: 20210111 |
|
AS | Assignment |
Owner name: SILICON VALLEY BANK, CALIFORNIA Free format text: SECURITY INTEREST;ASSIGNOR:CLOUDIAN HOLDINGS INC.;REEL/FRAME:064252/0451 Effective date: 20210111 |
|
AS | Assignment |
Owner name: FIRST-CITIZENS BANK & TRUST COMPANY, AS AGENT, CALIFORNIA Free format text: SECURITY INTEREST;ASSIGNOR:CLOUDIAN HOLDINGS INC.;REEL/FRAME:064671/0010 Effective date: 20230817 |
|
AS | Assignment |
Owner name: CLOUDIAN HOLDINGS INC., CALIFORNIA Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:SILICON VALLEY BANK;REEL/FRAME:068254/0994 Effective date: 20240812 |
|
AS | Assignment |
Owner name: AVIDBANK, CALIFORNIA Free format text: SECURITY INTEREST;ASSIGNOR:CLOUDIAN HOLDINGS INC.;REEL/FRAME:068332/0854 Effective date: 20240816 |