WO2012121316A1 - 分散ストレージシステムおよび方法 - Google Patents
分散ストレージシステムおよび方法 Download PDFInfo
- Publication number
- WO2012121316A1 WO2012121316A1 PCT/JP2012/055917 JP2012055917W WO2012121316A1 WO 2012121316 A1 WO2012121316 A1 WO 2012121316A1 JP 2012055917 W JP2012055917 W JP 2012055917W WO 2012121316 A1 WO2012121316 A1 WO 2012121316A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- information
- unit
- node
- access
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2308—Concurrency control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2308—Concurrency control
- G06F16/2336—Pessimistic concurrency control approaches, e.g. locking or multiple versions without time stamps
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0614—Improving the reliability of storage systems
- G06F3/0617—Improving the reliability of storage systems in relation to availability
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/065—Replication mechanisms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1095—Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
Definitions
- the present invention relates to distributed storage, and more particularly to a distributed storage system, method and apparatus capable of controlling a data structure.
- ⁇ Distributed storage system Distributed to realize a system in which a plurality of computers (data nodes, or simply “nodes”) are connected to a network, and data is stored and used in a data storage unit (HDD (Hard Disk Drive) or memory) of each computer.
- a storage system (Distributed Storage System) is used.
- a file management In a distributed storage system, generally, as a file management, a method of separately storing a file body and metadata of the file (file storage location, file size, owner, etc.) is used.
- a meta server method is known as one of techniques for a client to know a node holding data.
- a meta server configured by one or a plurality of (however, a small number) computers for managing data location information is provided.
- ⁇ Dispersed KVS> As another technique (technique) for knowing the position of the node holding the data, there is a technique for obtaining the position of the data using a dispersion function (for example, a hash function). This type of technique is called, for example, a distributed KVS (Key Value Store).
- a dispersion function for example, a hash function
- node list In distributed KVS, all clients share a distributed function and a list of nodes participating in the system (node list).
- the stored data is divided into fixed-length or arbitrary-length data fragments (Value).
- a uniquely identifiable identifier (Key) is assigned to each data fragment (Value) and stored as a pair of (Key, Value).
- Key For example, data can be distributed and stored in a plurality of nodes by changing the storage destination node (server) according to the key value.
- the key is used as the input value of the distribution function, and the position of the node storing the data is obtained arithmetically based on the output value of the distribution function and the node list.
- the distribution function basically does not change over time (time invariant).
- the contents of the node list are changed as needed due to the failure or addition of nodes. For this reason, the client needs to be able to access such information by an arbitrary method.
- ⁇ Replication> In a distributed storage system, in order to ensure availability (availability: the ability of the system to operate continuously), data replication is generally held in multiple nodes, and data replication is used for load balancing. It has been broken.
- Patent Document 1 discloses a technique for realizing load distribution using a copy of data to be created.
- Patent Document 2 searched as a result of the prior document search performed in this case, the server defines an information structure definition body in the information structure definition section, the registration client constructs a database with the information structure definition body, and database access A configuration in which a tool is generated and information is registered in a database using this tool is disclosed.
- Patent Document 3 discloses a storage node that stores a copy of an object that can be accessed through a unique locator value in a distributed storage system, and a key map that stores a key map entry for each object. There is disclosed a configuration in which each key map entry includes a copy of an object, a corresponding key value, and each locator for a predetermined object.
- JP 2006-12005 A Patent No. 4528039
- JP-A-11-195044 Patent No. 3911810
- duplicate data is held in a plurality of nodes with the same physical structure. This realizes access response performance and availability guarantee.
- the replicated data is held in the same physical structure, for applications with different data usage characteristics, conversion to another data structure and storage for holding another data structure Must be prepared.
- an object of the present invention is to provide a distributed storage system and method that ensure availability in data replication in a distributed storage, and at least one of avoiding a decrease in storage use efficiency and avoiding a decrease in response performance. Is to provide.
- the configuration is roughly as follows, although not particularly limited.
- each of the data nodes includes a data storage unit and is connected to the network, and the data nodes of the data duplication destination are logically the same among the data nodes, but physically Provides a distributed storage system including at least two data nodes that hold different data structures in their respective data storages.
- a data node device constituting a distributed storage system is network-coupled to another data node and the update target data is replicated to a plurality of data nodes, at least one other data is related to the data.
- a data node device that holds data structures that are logically the same between nodes but physically different in the data storage unit.
- a distributed storage method in a system including a plurality of data nodes each having a data storage unit and network-connected, wherein at least two data nodes of the plurality of data nodes are the data
- a distributed storage method in which a plurality of types of data structures which are logically the same among nodes but physically different are held in the respective data storage units.
- the plurality of data nodes may perform conversion to a target data structure asynchronously with a data update request.
- the received data is held in an intermediate data holding structure, a response to the update request is returned, and the data structure held in the intermediate data holding structure is changed to a target data structure.
- the data arrangement destination, the data structure of the arrangement destination, and the data division are variably controlled in a predetermined table unit.
- At the time of data replication in distributed storage at least one of ensuring availability, avoiding a decrease in storage use efficiency, and avoiding a decrease in response performance is possible.
- FIG. 6B is a diagram (2) illustrating an operation sequence of Write processing in the first exemplary embodiment of the present invention. It is a figure explaining the operation
- FIG. 10 illustrates a column-based consistent hashing split arrangement in the fourth exemplary embodiment of the present invention. It is a figure explaining the consistent hashing division
- the present invention has a plurality of types of data structures and is logically the same among data arrangement nodes (referred to as “data nodes”), but physically different. Keep a replica of the structure.
- data nodes data arrangement nodes
- the application timing of data structure conversion performed asynchronously with the write (update) request can be controlled.
- a structure giving priority to the response characteristic of Write is provided with an intermediate structure (intermediate data holding structure), and the data structure held in the intermediate structure is asynchronously converted into a target data structure.
- an interface for changing control parameters is provided. Change the control parameters according to the access load. Alternatively, when the processing load increases, control such as reducing the partitioning granularity is performed.
- a key value store that can have a plurality of types of data structures.
- the logically the same contents but physically different data structures are replicated (replicas).
- ⁇ It is possible to handle different types of access loads at high speed, ⁇ Multiples (replicas) for maintaining availability can be used for other purposes, enabling efficient use of data capacity.
- the data node that receives the data from the data transmission source holds the received data in an intermediate structure format instead of immediately converting the received data to the target structure in synchronization with replication, and converting the received data to the target structure. May be performed asynchronously.
- an intermediate structure that prioritizes response characteristics for access requests such as holding data in a buffer and immediately returning a response to a write request, and converting the data structure held in the intermediate structure to the target structure asynchronously Therefore, it is possible to maintain the required high availability while avoiding the bottleneck in access performance caused by the data structure conversion processing. Updating and converting to multiple types of data structures simultaneously on multiple data nodes in a distributed storage system tends to be a bottleneck in performance.
- a write-specific structure (intermediate data holding structure that prioritizes the write response performance) is prepared, and at the time of replication execution for guaranteeing availability, the replication is performed in an intermediate structure in a synchronous manner (Sync). Then, the data held in the intermediate structure is converted into a formal target structure asynchronously (Async).
- partitioning can be controlled in units of tables.
- -Row-store -Write-once type (add records to the data storage area), ⁇ Renewal type, -Column-store: ⁇ With or without compression
- -Write log for example, a structure for adding update information to prioritize write performance: -Existence of index (index data for search): -Is the data storage order sorted? -With / without partitioning, number of partitions: -Partitioning unit, algorithm: A combination is selected for items such as.
- data node data is placed is a control target, and which data structure is a control target.
- a write log write log
- a write-once row table row-table
- a row-table is selected for the combination of Read and Write.
- a column store (or column oriented database) is selected.
- the column store method increases the efficiency of storage read access to a query.
- the granularity of partitioning may be relatively reduced for distributed processing, and control such as increasing partitioning for centralized processing or stopping partitioning may be performed. .
- a trigger for triggering the conversion of data held in the intermediate structure to the target structure asynchronously may be set as the control target.
- the timing of data conversion may be adjusted according to the data freshness (a measure of data freshness) required by the analysis application.
- a data update request from a client in a distributed storage system each having a data storage unit (12 in FIG. 1) and having a plurality of data placement nodes (data nodes) coupled to the network.
- data nodes data placement nodes
- replication data is stored in the data storage unit (one or more types of data structures different from the data structure in the database that received the update request). It is stored in 12) of FIG.
- the data node temporarily holds the intermediate structure and returns a response to the update request to the client, and stores the converted data asynchronously with the target data structure.
- an apparatus (9 in FIG. 1) for holding and managing data structure information for example, data structure management information and data arrangement specifying information
- data structure information for example, data structure management information and data arrangement specifying information
- data access means (611 in FIG. 1).
- the means for accessing the data node (112 in FIG. 4) determines the data structure (physical structure) for the replication target data using the data structure information. Therefore, the replicated data can be held with different data structures for each distributed storage node.
- a replication destination data node has an intermediate structure that prioritizes update processing performance in response to an update request from a client (both the intermediate data holding structure and the intermediate buffer structure).
- the data is temporarily held and responded to the update request, and the conversion to the data structure specified by the data structure management information is executed asynchronously. For this reason, the response performance of the update process can be maintained while holding a plurality of types of data structures in the intermediate data holding structure.
- the client side has a plurality of types of data structures, and the client side distributes the processing to an appropriate data structure according to the access contents (to access a data node holding the appropriate data structure) Distribution) may be performed. For this reason, access processing performance can be improved.
- ⁇ Data location For example, ⁇ Data location, ⁇ Data arrangement (internal) structure, -A storage method that stores data in a distributed or centralized manner. It is not possible to variably control the storage format of duplicate data such as.
- the data migration source storage / database and the migration destination storage / database represent the same data in different data structures.
- the data migration source storage / database and the migration destination storage / database represent the same data in different data structures.
- the storage capacity required for replication is data capacity x number of replicas x number of types of data structures). For this reason, by preparing and using a lot of hardware such as computers and disks, operation costs such as purchase cost and power consumption increase (a large amount of data copy and a large amount of data structure conversion processing are required. )
- a bottle of data structure conversion is ensured by maintaining the required high availability and performance such as high-speed response by holding duplicate data in a plurality of types of data structures (physical structures).
- the bottleneck can be eliminated and the storage usage efficiency can be improved.
- FIG. 1 is a diagram illustrating an example of a system configuration according to the first embodiment of this invention.
- Data nodes 1 to 4 a network 5, a client node 6, and structure information management means (structure information management apparatus) 9 are provided.
- Data nodes 1 to 4 are data storage nodes constituting the distributed storage, and are configured by one or more arbitrary numbers.
- the network 5 realizes communication between network nodes including the data nodes 1 to 4.
- the client node 6 is a computer node that accesses the distributed storage.
- the client node 6 does not necessarily exist independently. An example in which the data nodes 1 to 4 also serve as client computers will be described later with reference to FIG.
- the data nodes 1 to 4 include data management / processing means (data management / processing units) 11, 21, 31, 41, and data storage units 12, 22, 32, 42, respectively.
- the client node 6 includes client function realization means (client function realization unit) 61.
- the client function realization means 61 accesses the distributed storage composed of the data nodes 1 to 4.
- the client function realizing unit 61 includes a data access unit (data access unit) 611.
- the data access means (data access unit) 611 acquires structure information (data structure management information and data arrangement specifying information) from the structure information management means 9, and uses the structure information to specify an access destination data node.
- the structure information stored in the structure information holding unit 92 of the structure information management means 9 is stored in the own device in each data node 1 to 4 or an arbitrary device (switch, intermediate node) in the network 5. Or you may make it hold
- the access to the structure information stored in the structure information holding unit 92 is made to access the cache provided in the device itself or at a predetermined location. It may be.
- the synchronization of the structure information stored in the cache can be applied with a well-known distributed system technique, and the details are omitted here. As is well known, storage performance can be increased by using a cache.
- the structure information management means (structure information management apparatus) 9 includes a structure information change means 91 for changing structure information, and a structure information holding unit 92 for holding structure information.
- the structure information holding unit 92 includes data structure management information 921 (see FIG. 4) and data arrangement specifying information 922 (see FIG. 4).
- the data structure management information 921 will be described later with reference to FIG. 5, a replica identifier that identifies a replica with respect to a table identifier, and data structure information that identifies the type of data structure corresponding to the replica identifier; , It has entries for the number of times of data replication, each of which is an update trigger that is time information until the data structure is stored.
- the data arrangement specifying information 922 will be described later with reference to FIG. 8, but the replica identifier and one or a plurality of data arrangement destination data node information corresponding to the replica identifier are associated with the table identifier. Have.
- the client node 6 it is not always necessary to provide the client node 6 independently (separately) from the data nodes 1 to 4. That is, as will be described below as a modification, a configuration in which the client function realizing unit 61 is provided in any one or more of the data nodes 1 to 4 may be employed.
- FIG. 2 is a diagram showing a configuration of a modification of the first embodiment of the present invention. As shown in FIG. 2, a client function realization means 61 is disposed in each of the data nodes 1, 2, 3, 4.
- the client function realization means 61 arranged in the data nodes 1, 2, 3, 4 includes a structure information cache holding unit 612 in addition to the data access means 611 in FIG.
- the structure information cache holding unit 612 is a cache memory that stores part or all of the structure information stored in the structure information holding unit 92.
- Structure information synchronization means (structure information synchronization device) 93 controls the synchronization of the structure information cache. Data in the structure information holding unit 92 is acquired, and information in the structure information cache holding unit 612 of the client function realizing unit 61 of the data node is updated.
- Structure information synchronization means 93 may be provided in any number of devices constituting the system. For example, it may be operated as software on a computer that realizes at least one of the data nodes 1 to 4.
- FIG. 3 shows an example in which the data nodes 1 to 4 are realized as individual computers in FIG.
- the network node 105 includes one or more arbitrary number of data node computers 101 to 104 and a network 105.
- the data node computers 101 to 104 include a CPU 101a, a data storage device 101b, and a data transfer device 101c, respectively.
- the CPU 101a realizes all or part of the functions of the data management / processing unit 21 and the client function realizing unit 61.
- the data storage device 101b is, for example, a hard disk drive, a flash memory, a DRAM (Dynamic Random Access Memory), an MRAM (Magnetic Responsive Random Access Memory), a FeRAM (Ferroelectric Random Access Memory RAM), a FeRAM (Ferroelectric Random Access Memory RAM).
- a storage device a physical medium capable of recording data such as a magnetic tape, or a control device that records data on a medium installed outside the storage node.
- the network 105 and the data transfer apparatus 101c include, for example, Ethernet (registered trademark), Fiber Channel, FCoE (Fibre Channel over Ethernet (registered trademark)), InfiniBand (high-speed IO bus architecture promoted by organizations such as Intel Corporation), QsNet (Quadrics).
- the implementation method of the network 105 is not limited to these.
- the data transfer apparatus 101c is configured by a network card connected to a computer, and the network 105 is configured by an Ethernet (registered trademark) cable and a switch.
- the realization of the data nodes 1 to 4 may be a virtual machine (Virtual Machine).
- Representative examples include VMWare (product of VMWare), Xen (trademark of Citrix) and the like.
- FIG. 4 is a diagram for explaining the configuration example of the first embodiment of the present invention in more detail.
- FIG. 4 shows a configuration centered on the data nodes 1 to 4 of FIG.
- the structure information stored in the structure information holding unit 92 may be referred to by the reference numeral 92 for simplification.
- the data management / processing unit 11 of the data node includes an access receiving unit (access receiving unit) 111, an access processing unit (access processing unit) 112, and a data structure converting unit (data structure converting unit) 113.
- the access receiving unit 111 receives an access request from the data access unit 611 and returns a response to the data access unit 611 after the processing is completed.
- the data storage unit 12 includes a plurality of types of structure-specific data storage units.
- a structure-specific data storage unit 121 data structure A
- a structure-specific data storage unit 122 data structure B
- a structure-specific data storage unit 123 data structure C
- the structure-specific data storage unit 121 (for example, the data structure A) has a structure specialized in response performance to a process (data addition or update) involving data writing. Specifically, the data change contents are queued (for example, FIFO (First In First Out)) and the software that holds the high-speed memory (dual port RAM, etc.) and the access request processing contents are added as a log to an arbitrary storage medium. Software etc. are implemented.
- the data structure B and the data structure C are data structures different from the data structure A and have different data access characteristics.
- the data storage unit 12 is not necessarily a single storage medium.
- the data storage unit 12 of FIG. 4 may be realized as a distributed storage system including a plurality of data placement nodes, and the structure-specific data storage units 12X may be distributed and stored.
- the data arrangement specifying information 922 is information (and means for storing and acquiring information) for specifying the storage location of data or data fragments stored in the distributed storage. As described above, for example, a meta server method or a distributed KVS method is generally used as the data distribution method.
- information for managing data location information (for example, a block address and a corresponding data node address) is the data arrangement specifying information 922.
- the meta server can know where to place the necessary data.
- a list of nodes participating in the system corresponds to this data arrangement specifying information.
- the data node of the data storage destination can be determined.
- the data access means 611 uses the data arrangement specifying information 922 in the structure information managing means 9 or the cache information of the data arrangement specifying information 922 stored in a predetermined location to be accessed from the data nodes 1 to 4 is specified, and an access request is issued to the access receiving means 111 of the data node.
- the data structure management information 921 is parameter information for specifying a data storage method for each data set.
- FIG. 5 is a diagram showing an example of the data structure management information 921 of FIG.
- the unit for controlling the data storage method is a table. Then, for each table (for each table identifier), a replica identifier, a data structure type, and an update trigger information are prepared for the number of data replications.
- each table holds three replicas to ensure (hold) availability.
- the replica identifier is information for identifying each replica, and is assigned as 0, 1, and 2 in FIG.
- the data structure is information indicating a data storage method.
- different types of data structures (A, B, C) are designated for each replica identifier.
- FIG. 5B shows examples of data structures A, B, and C.
- the replica identifier 0 of the table identifier “Stocks” is stored as the data structure B (row store).
- Each data structure is a method for storing data
- the queue (QUEUE) is a linked list.
- the row store (ROW STORE) stores the records of the table in the order of rows (ROW).
- Fig. 6 shows an example of the data holding structure of the table.
- the table in FIG. 6A includes a Key column and three Value columns, and each row includes a set of Key and three Value.
- the column store and the row store indicate a format in which the storage order on the storage medium is stored in a row (row) base and a column (column) base.
- FIG. 6 as a table (see FIG. 6A) storage method,
- the data of the replica identifiers 0 and 1 is held in the data structure B (row store) (see (B) and (C) of FIG. 6),
- the data of the replica identifier 2 is held as a data structure C (column store) (see (D) of FIG. 6).
- the update trigger in the data structure management information 921 is a time trigger until the data is stored as the designated data structure.
- the replica identifier 0 of Stocks 30 sec is specified. Therefore, in the data node storing the data structure B (row store) of the replica identifier 0 of Stocks, it is indicated that the update of data is reflected in the data storage unit 122 by structure of the row store method for 30 sec. . Until the data update is reflected, the data is held as an intermediate structure such as a queue. In the data node, a response from the client is also stored in the intermediate structure. In the present embodiment, the conversion to the designated data structure is performed asynchronously with the update request.
- FIG. 7 is a diagram schematically illustrating an example of table data retention and asynchronous update.
- each data node has a structure excellent in response speed of Write (update request) as an intermediate structure and accepts update contents.
- a process completion response is returned to the update request source client.
- Update data written to the intermediate structure of each data node (also referred to as an intermediate structure for write, a write-priority intermediate structure, or “intermediate data holding structure”) is stored in data structures B and C, respectively, in each data node. Updated asynchronously (Async).
- data structure A is stored and held in the intermediate structure for Write in the data node having the replica identifier of 0 by Write, and the synchronization method (Synchronous) for the data nodes of replica identifiers 1 and 2 is performed.
- the data of the data structure A held in the intermediate structure for Write is replicated, and the data of the data structure A is temporarily stored and held in the intermediate structure for Write in each of the data nodes of the replica identifiers 1 and 2.
- the replication between the data nodes of the update data (data structure A) written in the intermediate structure for Write of one data node is performed in synchronization with the write (update).
- the response speed of the write can be increased for the data that is not immediately read (read) access to the write (write) data.
- the number of types of data structures is set to three, A, B, and C, for the sake of simplification of explanation, but it goes without saying that the number of types of data structures is not limited to three. There may be any plural types having different characteristics.
- three types of queue, column store, and row store are illustrated as examples of the data structure, it is needless to say that the data structure is not limited to this example. For example, -Whether there is an index in the row store structure, ⁇ Difference in the type of column that created the index, -Row store format for storing updates in an appending structure, Etc.
- a data storage program may be specified instead of the data structure type.
- different database software is designated as the program A for storing data in the queue as the data structure A and the data structures B and C in FIG.
- the data node storing the replica identifier of the table in which the data structure A is designated processes the received data by executing the program A.
- FIG. 8 shows an example of the data arrangement specifying information 922 of FIG. An arrangement node is designated for each of the replica identifiers 0, 1, and 2 of each table identifier. This corresponds to the metaserver method described above.
- the data arrangement specifying information 922 corresponds to node list information (not shown) participating in the distributed storage.
- the placement node can be specified by the consistent hashing method using “table identifier” + “replica identifier” as key information. Further, it can be stored in an adjacent node in the consistent hashing method as a replica placement destination.
- the consistent hashing method will be described in the fourth embodiment.
- the arrangement node must be specified so that the same table is not held in the same node in order to guarantee availability.
- the replica nodes 0, 1 and 2 in the Stocks table in FIG. 5A must not overlap each other. Note that this restriction does not apply if availability considerations are ignored. That is, a plurality of types of replicas may be held in the same node.
- FIG. 9 is a diagram showing a sequence of a write process (a process accompanied by an update) in the first embodiment of the present invention described with reference to FIGS.
- the client function realization means 61 acquires the information of the data arrangement specifying information 922 (see FIGS. 4 and 8) held in the structure information holding unit 92 of the structure information management means 9 (or information from the cache memory at an arbitrary location) To get).
- the client function realization means 61 uses the acquired information to issue a write access command to the data node (data node 1 with replica identifier 0 in FIG. 9) where data to be written is placed.
- the access receiving means 111 of the data node 1 receives a write access request (write processing request), and transfers the write access to the data nodes 2 and 3 designated by the replica identifiers 1 and 2.
- a write access request write processing request
- the data node 1 may access the structure information holding unit 92 (or an appropriate cache), or data in a Write access command issued by the client implementation unit 61. All or a part of the structure management information 921 may be delivered together.
- the access processing means 112 of each data node processes the received write access request.
- the access processing means 112 refers to the information of the data structure management information 921 and executes the write process.
- the write processing content is stored in the data storage unit 121 by structure of the data structure A.
- the data is stored in the structure-specific data storage unit 12X having the data structure specified in the data structure management information 921.
- the access processing unit 112 issues a completion notification to the access receiving unit 111 after the write processing is completed.
- the replica data node (2, 3) returns a write completion response to the access receiving unit 111 of the replica data node 1.
- the access receiving unit 111 waits for the completion notification from the access processing unit 112 of the data node 1 and the completion notification of the data nodes 2 and 3 of the replica destinations, and after receiving all of them, responds to the client function realizing unit 61 .
- the data structure conversion means 113 (see FIG. 4) periodically stores the data of the structure-specific data storage unit 121 (data structure A) as the final data storage unit 12X (data structure management information 921). Convert to (destination data structure) and store.
- the data node 1 transfers the write access to the replica destination data nodes 2 and 3.
- write access may be issued to all of the data nodes.
- FIG. 10 is different from FIG. 9 in that the client function realization means 61 waits for a write access request.
- FIG. 11 is a diagram showing a sequence of reference system processing (READ processing) in the first embodiment of the present invention.
- the client computer (client node) 6 acquires the information of the data structure management information 921 and specifies the instruction execution destination node. Any of the replica identifiers may be used as the node where the replica data is arranged, but it is desirable to select an appropriate node according to the processing to be performed.
- Reference system processing refers to processing involving data reading, and corresponds to, for example, a command by a Select statement in a SQL (Structured Query Language) statement.
- SQL Structured Query Language
- Reading data from a table A corresponds to the reference processing.
- a process for updating the table A after referring to the table A it may be handled collectively as a write process (described in FIGS. 9 and 10).
- the reference process of the table A may be handled as a reference process, and the update of the table A may be handled as an update process.
- FIG. 12 is a flowchart for explaining the operation of access processing from the viewpoint of the client function realization means 61. The client access flow will be described with reference to FIG.
- the client function realization means 61 acquires the information in the structure information holding unit 92 by accessing the master or a cache at an arbitrary location (step S101 in FIG. 12).
- step S102 it is identified whether the command issued by the client is a write process or a reference process (Read) (step S102).
- the client function realization means 61 may be used to explicitly specify when calling an instruction (preparing such an API (Application Program Interface)).
- step S102 If the result of step S102 is a write process, the process proceeds to step S103 and subsequent steps.
- the client function realization means 61 specifies a node that needs to be updated using information in the data arrangement specifying information 922. This process is as described with reference to FIG.
- the client function realization means 61 issues a command execution request (update request) to the identified data node (step S103).
- the client function realization means 61 waits for a response notification from the data node to which the update request is issued, and confirms that the update request is held in each data node (step S104).
- FIG. 12 is a flowchart for explaining the operation of the client function realization means 61 corresponding to the sequence of FIG. 10 in which the client function realization means 61 issues a command to the update destination data node and waits.
- step S102 If the result of step S102 is a reference process, the process proceeds to step S105.
- step S105 first, the client function realization means 61 specifies (recognizes) the characteristics of the processing content (step S105).
- the client function realization means 61 performs a process of selecting an access target data node and issuing a command request based on the specified processing characteristics and other system conditions (step S106).
- the client function realization means 61 then receives the access processing result from the data node (step S107).
- step S105 the description of the processing of step S105 and step S106 will be supplemented.
- the client function realization means 61 can know the type of the data structure in which the data to be accessed is held from the information stored in the data structure management information 921. For example, in the case of the example of FIG. 5A, when accessing the WORKERS table, the replica identifiers 0 and 1 are the data structure B, and the replica identifier 2 is the data structure C.
- the client function realization means 61 determines which data structure is suitable for data access performed on the data node, and selects a suitable one.
- the client function realization means 61 analyzes an SQL statement that is an access request, and in the case of an instruction that takes the sum of a certain column in a table whose table identifier is “WORKERS”, the data structure C (column In the case of an instruction to select (store) and retrieve a specific record, it is determined that the data structure B (row store) is suitable.
- replica identifier 0 or 1 may be selected.
- the update opportunity is 30 sec
- the command passed to the client function implementing means 61 may be in a format that explicitly specifies the data structure to be used and the information specifying the required data freshness (data freshness).
- the data node to be accessed is calculated.
- the selection of the access node may be changed according to the situation of the distributed storage system. For example, if a table is stored in the data nodes 1 and 2 as the same data structure B, and the access load of the data node 1 is large, the operation may be changed to select the data node 2 Good.
- the access load of the data node 3 is smaller than that of the data nodes 1 and 2, the access contents to be processed are those of the data structure B.
- the access request may be issued to the data node 3 (data structure C).
- the client function realization means 61 issues an access request to the data node calculated and selected in this way (step S106), and receives an access processing result from the data node (step S107).
- FIG. 13 is a flowchart for explaining access processing in the data node of FIG. The operation of the data node will be described in detail with reference to FIGS.
- the access receiving unit 111 of the data management / processing unit 11 of the data node receives an access processing request (step S201 in FIG. 13).
- the access receiving unit 111 of the data management / processing unit 11 of the data node determines whether the content of the received processing request is a write process or a reference process (step S202).
- step S203 the access processing means 112 of the data management / processing means 11 of the data node acquires information on the data structure management information 921 in the structure information holding unit 92 (step S203).
- the information acquisition of the data structure management information 921 may be performed by accessing the master data, accessing the cache data at an arbitrary location, or the client function realizing unit 61 of FIG. 1 or FIG.
- information access to master data or cache data
- the access processing unit 112 may access the information using the information.
- the access processing means 112 determines from the information of the data structure management information 921 whether or not the process update trigger for the data node is “0” (zero) (step S204).
- the access processing unit 112 stores the update data in the intermediate structure for write (structure-specific data storage unit 121) (step S206).
- steps S205 and S206 after the processing is completed, the access receiving unit 111 returns a processing completion notification to the requesting client realizing unit 61 (step S207).
- step S208 If the result of step S202 is a data reference process, the reference process is executed (step S208).
- processing is performed using data in the data storage unit having the data structure specified in the data structure management information 921.
- This has the best performance, but if the update opportunity is large, there is a possibility that the intermediate structure data for Write is not reflected in the reference process. For this reason, data inconsistency may occur.
- the application developer recognizes and uses it in advance, or if it is known that data reading does not occur within the update timing after Write, or if new data access is required, the update trigger is There is no particular problem if it is decided to access the replica identifier data of “0”.
- the second method is a method of performing processing after waiting for the application of conversion processing performed separately. This is easy to implement, but the response performance deteriorates. For applications that do not require response performance, there is no problem.
- the third method reads and processes both the data structure specified in the data structure management information 921 and the data held in the intermediate structure for Write. In this case, the latest data can always be responded, but the performance is deteriorated as compared with the first method.
- Any of the first to third methods may be used. Also, a method to be executed may be specified in a processing command issued from the client function realizing unit 61 that realizes a plurality of types and is described as a system setting file.
- FIG. 14 is a flowchart showing the operation of the data conversion process in the data structure conversion means 113 of FIG. The data conversion process will be described with reference to FIGS.
- the data structure conversion means 113 waits for a call due to the occurrence of a timeout at a timer (not shown in FIG. 4) in the data node in order to periodically determine whether or not conversion processing is necessary (step S301 in FIG. 14).
- This timer may be provided in the data structure conversion means 113 as a dedicated timer.
- the timeout time of the timer corresponds to the update process (sec) in FIG.
- the structure information (data information) of the structure information holding unit 92 is acquired (step S302), and it is determined whether there is a data structure that needs to be converted (step S303). For example, when the determination is performed every 10 seconds by the timer, the data structure whose update opportunity is 20 seconds executes the conversion process every 20 seconds, so the conversion process does not have to be performed at the 10 second time point.
- step S301 If the conversion process is not necessary, the process returns to the timer call waiting (wait until it is called due to the occurrence of a timeout in the timer) (step S301).
- FIG. 1 A second embodiment of the present invention will be described.
- data is divided into a plurality of predetermined units and stored in a plurality of data nodes.
- the basic configuration of the system of the present embodiment is the configuration shown in FIGS. 1, 2, 4, and the like, but as described with reference to FIGS. 15, 16, in this embodiment, data
- the contents of the structure management information 921 and the data arrangement specifying information 922 are expanded.
- FIG. 1 A second embodiment of the present invention.
- the access accepting unit of the data node issues an access request to the access processing unit
- the access processing unit of another data node Even in the first embodiment, the access request is issued, and the data structure conversion unit issues a change request to the data structure conversion unit of another data node. It is different.
- the configuration of the data node in the present embodiment basically follows FIG. 4, but the details will be described later with reference to FIG.
- the data to be stored (table identifier) is partitioned (divided) for each copy storage unit (replica identifier), and the divided storage unit is stored in each data node. it can.
- FIG. 15 is a diagram showing an example of the data structure management information 921 (see FIG. 4).
- the data structure management information 921 includes a replica identifier corresponding to the number of replicas and the number of partitions corresponding to the replica identifier for the table identifier.
- the replica identifier whose partition number is “1” stores a replica (replica) in one data node.
- the operation in that case is the same as that of the first embodiment.
- FIG. 16 is a diagram illustrating an example of the data arrangement specifying information 922 in that case.
- the number of partitions of the replica identifier 2 of the table identifier “WORKERS” is “4”.
- node numbers 2, 3, 5, and 6 are designated as “arrangement nodes” of the replica identifier 2 of the table identifier “WORKERS”.
- the placement node is determined so as to maintain the required availability level assumed for the entire system for each table identifier. It may be performed manually (manually), or the contents of the data structure management information 921 in FIG. 15 and the data arrangement specifying information 922 in FIG. 16 may be automatically generated by a program.
- the availability level is determined according to the number of replicas (number of replicas). If the required availability level is three replicas, three replica identifiers are prepared and determined so that the respective placement nodes do not overlap each other.
- the placement nodes of the replica identifier of the table identifier “WORKERS” are specified not to overlap each other.
- four or more replica identifiers may be prepared. For example, when there are four replica identifiers and the availability level to be calculated remains “3”, up to one replica identifier can be selected as the replica identifier placement node of the same table identifier (for example, four replica identifiers). Among them, there may be two replica identifiers with overlapping placement nodes).
- placement nodes at the time of partitioning can be stored redundantly.
- two 12-part replicas are stored in the data node of node number 1-18 in the row store format (data structure B), they cannot be stored unless duplication is allowed.
- the placement nodes can be allocated in an overlapping manner while satisfying the availability of the two replica levels as follows.
- the replica identifier 0 is a node number 1-12
- Replica identifier 1 has node numbers 7-18, It shall be divided and stored.
- the availability level can be satisfied.
- the node number 1-6 of replica identifier 0 has the first half of the column value
- the node number 7-12 has the second half of the column value
- the same record is stored in the same node. Is avoided. By doing so, it is possible to satisfy the availability while overlapping the allocation of the placement nodes.
- the placement node destination is determined so as to satisfy the availability level specified for each system or table identifier.
- the access destination at the time of updating a replica identifier having a partition number greater than “1” may be any of the placement node groups.
- the first node in the list may always be selected (for example, in the case of the replica identifier “2” of the table identifier “WORKERS”, the data node having the node number 2).
- the conversion process from the structure-specific data storage unit 121 to the structure-specific data storage units 122 and 123 in the data structure conversion unit 113 is somewhat simplified.
- the storage destination may be determined based on the value of a column with a table as described above, a unique key range, or the like.
- the conversion process (step S305 in FIG. 14) in the data structure conversion unit 113 (see FIG. 17) and the update opportunity are compared with those in the first embodiment.
- the data structure update process (step S205 in FIG. 13) is different, and the data storage unit of the designated placement node destination is updated.
- the access receiving unit 111 accesses the access processing unit 112 ( It is necessary to issue an access request to (see FIG. 17).
- a request is issued to all access processing means 112 of the data node in which the record to be processed is stored.
- the selection of the necessary data nodes depends on the distributed placement strategy.
- FIG. 17 is a diagram showing the configuration of the second exemplary embodiment of the present invention, and shows the configuration of the data nodes 1 to X.
- the access receiving unit 111 sends another request to the access processing unit 112 in its own node when issuing an access request. May also be issued to the access processing means 112.
- the data structure conversion unit 113 periodically determines whether or not conversion processing is necessary. When the data structure is converted, the data structure conversion unit 113 stores the data structure conversion unit 113 of another data node that stores the partitioned data. In response, a data conversion request is issued.
- data can be divided and stored in a plurality of data nodes.
- the data structure management information 921 is changed according to the access load.
- correction of inappropriateness of the data structure setting contents (assignment of data structure for each replica identifier as shown in FIG. 5) and access after system operation It is possible to cope with pattern changes.
- the operation of autonomously changing the control parameters for realizing this will be described.
- FIG. 18 is a diagram showing a configuration of a data node according to the third exemplary embodiment of the present invention.
- a history recording unit 71 and a change determination unit (change determination unit) 72 are added.
- the access receiving unit 111 (or other arbitrary unit) of each data node according to the present embodiment operates to record the received access request in the history recording unit 71.
- the history recording unit 71 records an access request (or access processing content) for each replica identifier of each table identifier.
- the history recording unit 71 may have a single configuration for the entire system.
- each data node includes a history recording unit 71, and each data node individually records an access request for each replica identifier of each table identifier.
- a mechanism for aggregation may be provided by a method.
- the change determination means (change determination unit) 72 uses the history information stored in the history recording unit 71 to determine whether to convert the data structure.
- One change determination unit 72 may be provided for the entire system, or the change determination unit 72 may be operated in a distributed manner at each data node.
- the change determining means 72 issues a data structure conversion processing request to the structure information changing means 91 when structure conversion is required.
- the structure information changing unit 91 changes the information in the structure information holding unit 92 in response to the conversion processing request from the change determining unit 72, and further, the data structure converting unit in the data management / processing unit 11 of the target data node. 113 is requested to perform conversion processing.
- FIG. 19 is a flowchart for explaining the control operation in the present embodiment shown in FIG.
- the execution cycle is arbitrary, for example, when the cycle is lengthened, it is necessary to match the change process being executed.
- change processing may be performed in response to detection of a predetermined event.
- An event is, for example, a case where a change in load is detected by any one of the components of the system (for example, a large change in the hardware usage rate of some data nodes such as CPUs and disks). .
- FIG. 19 shows the determination of the necessity of structure conversion processing for each table identifier and the conversion processing. It is necessary to perform the flow of FIG. 19 for all table identifiers held and managed by the system.
- the conversion determination unit 72 acquires access history information of the history recording unit 71 (step S401).
- the conversion determination unit 72 uses all the access contents received within a recent fixed period (for example, within the last day or within the last week) as the corresponding table identifier. It is determined whether or not any replica has a suitable data structure (step S402).
- step S402 If the access content accepted in step S402 has a data structure suitable for one of the replica identifiers, the process proceeds to step S403.
- the case of having a data structure suitable for one of the replica identifiers means that, for example, when an access request that requires column (column) access is accepted, the data structure of any replica identifier is a column. This is the case with a store structure.
- step S403 the conversion determination unit 72 determines whether each replica identifier has an unnecessary data structure. For example, when there are no access requests that require column access as a history, but there are many column store structures, this is an unnecessary data structure.
- the conversion determination means 72 ends the flow because there is no need to perform a conversion process. On the other hand, if there is an unnecessary data structure, the process proceeds to step S404.
- step S404 the conversion determination unit 72 determines whether or not the data structure can be changed from the data structure of each replica identifier and the access request amount / content.
- the determination as to whether or not the data structure can be changed is made based on, for example, a predefined rule.
- the rules include the following. Although not particularly limited, the rule is ⁇ Condition> then ⁇ Action> (execute an action when the condition is satisfied).
- R4 Increase the number of partitions when the number of read processing requests exceeds a certain number for the table identifier (or vice versa).
- step S404 If it is necessary to change the data structure or the number of replicas in step S404, the process proceeds to step S405. If it is not necessary, the conversion determination unit 72 ends the flow.
- step S405 the data structure is actually converted by the conversion determining means 72, the structure information changing means 91, the data structure converting means 113, and the like.
- the table identifier for increasing the number of replicas is increased by one in the data structure management information 921 of the structure information management unit 9, a unique replica identifier is assigned, and the placement node destination is determined.
- the placement node is determined in the same manner as in the first embodiment, but may be duplicated with other placement nodes as long as the number of replicas equal to or higher than the availability level is maintained.
- the replica replicates data from the same replica as the new replica identifier to the placement node destination.
- step S405 The operation of converting the data structure in step S405 will be described in more detail with reference to FIGS. For simplicity, in FIG. 20 and FIG. 21, the replica identifier is not partitioned. In the following, the conversion process of the data structure conversion unit 113 in FIG. 18 will be described along with an example of converting the data structure from B to C.
- FIG. 20 is a flowchart for explaining the data structure conversion operation in this embodiment.
- the conversion determination means 72 issues a change request to the data structure management information 921 (FIG. 4) of the structure information holding unit 92 (FIG. 16) (step S501, ie, step S405 of FIG. 19).
- the structure information changing unit 91 makes a conversion processing request to the data structure converting unit 113 of the data node X that is the change destination.
- step S502 the data node X having the replica identifier data to be changed creates a local replica (local replica) of the corresponding replica identifier.
- This local replication may use a snapshot technique based on storage instead of a physical copy.
- the replica identifier data of other nodes may be used as the conversion source data without copying. This duplication processing is not necessarily required depending on the conversion processing implementation method.
- step S503 as the structure conversion process, the data structure conversion unit 113 performs a process of reading the conversion source data from the data storage unit and writing the conversion destination data as a different data structure.
- the update data stored in the data structure of the data structure A is stored in the data storage section of the data structure A accumulated during the conversion process (or at the start of the conversion process). And applied to the data structure of the conversion destination (step S504).
- the conversion source data is deleted. Note that the conversion source data does not necessarily have to be deleted, but the memory utilization efficiency is improved by deleting the data.
- FIG. 21 is a diagram for explaining processing in the data node during conversion processing in the present embodiment shown in FIG.
- the access processing unit 112 responds to the access request using the data structure A and the data structure B.
- the update process is held in the data structure A (intermediate structure for Write), and is not applied to the data structure B (Row-Store) while the data structure conversion unit 113 performs the conversion process.
- the access processing unit 112 uses the data structure A, which is an intermediate structure for Write, and the data structure C (Column Store) of the conversion destination. Process access requests.
- the data node in the data structure conversion process is not accessed, and data of another replica identifier is used.
- part of the exclusive processing of the access processing means 112 during the data structure conversion processing becomes unnecessary, and the system configuration is simplified.
- Partition number change operation> 22 and 23 are flowcharts for explaining the operation of changing the number of partitions in this embodiment.
- the partition number changing process can be expressed as the same flowchart as in FIG. In the following, FIG. 22 will be described focusing on differences from FIG.
- the distribution strategy may be changed.
- the change of the distribution strategy for example, the change from the distribution by round robin to the distribution by the value range of an arbitrary column, or vice versa.
- step S602 the conversion determination unit 72 determines whether or not the number of distributions sufficient for the required performance is maintained with respect to the number of access request processes (for example, scanning all data). In many cases, it is advantageous in terms of performance to perform data parallel processing such as processing that is performed). If it is a necessary and sufficient number of distributions, the process proceeds to step S603. If it is not the necessary and sufficient number of distributions, the process proceeds to step S604.
- step S603 the conversion determination unit 72 determines whether unnecessary division is performed for each replica identifier. For example, although there are few requests for data parallel access processing, replica identifiers that are excessively distributed are applicable.
- step S604 If unnecessary division has been made, the process proceeds to step S604, and if not, the flow ends.
- step S604 the conversion determination unit 72 determines whether the number of partitions needs to be changed. As described above, the change contents of the number of partitions are determined based on arbitrarily specified rules. If no change is necessary, the conversion determination unit 72 ends the flow. If a change is necessary, the conversion determination unit 72 changes the number of partitions (step S605). Step S605 is processing for actually changing the number of partitions.
- FIG. 23 shows a flow of step S605 of FIG. 22 (partitioning number changing process by the conversion determining means 72). In the following, FIG. 23 will be described with a focus on differences from FIG.
- step S702 The local replication in step S702 is prepared for use in response to an access request during conversion processing as shown in FIG.
- Step S703 is a process of copying data to the data node of the change destination for the record whose placement node is changed by changing the number of partitions.
- Step S704 is substantially the same as S504 in FIG. 20, except that the application destination of the update processing content during the data structure conversion stored in the data structure A may be another data node.
- Step S705 is substantially the same as S505 in FIG.
- the capacity efficiency and storage cost of the system can be reduced by changing the placement destination node, writing a part of the data to a disk, or storing it in a separately prepared archive storage.
- a distributed arrangement strategy is determined in time series, and old data (B1, B2) is written on the disc (C1, C2).
- old data B1, B2
- C1, C2 old data
- the data may be written in another archive, and only new data (B3: latest partitioning table) may be held in the memory (C3).
- the data arrangement specifying information 922 of the structure information holding unit 92 is, for example, as shown in FIG.
- the data arrangement specifying information 922 includes information on an arrangement node, a distributed arrangement strategy, and an arrangement physical medium corresponding to each replica identifier with respect to the table identifier.
- the history record type table (A) in FIG. 24 is stored in the order of table identifiers.
- the information on the allocation strategy (round robin, column 1 value distribution, time series, etc.) is specified as the distributed allocation strategy.
- the replica identifier 2 of the table identifier “orders” is distributed and arranged in the arrangement node 2-10 in time series, and the physical medium (memory, disk, etc.) of the arrangement destination is specified.
- a hash function is used with a character string combining the key value, the column identifier, and the table identifier as an argument.
- a hash value is obtained, and a data node of a data arrangement destination is determined by a consistent hash from the hash value and storage destination node list information.
- hashing is performed using the key value of the record to determine the data placement node.
- a data placement node is determined using a key value or a unique (unique) record ID.
- an argument to the hash function in the consistent hashing method is A character string (table identifier: tableA + column identifier Value2 + Key value: acc) combined with table identifier + column name + Key value is passed, and a hash value is calculated.
- the data node can be determined by the consistent hashing method from the output (hash value) of the hash function for the argument and the information of the storage destination node list (for example, data nodes 1 to 4).
- FIGS. 27A and 27B are diagrams for explaining the recording method of the data arrangement node in the present embodiment. Since it is a column store format, data is recorded for each column.
- the outer rectangle is a management unit of the recording area of the data arrangement node, and corresponds to, for example, a page of a memory or HDD (hard disk drive).
- the page size may be arbitrary.
- Management information for designating a table identifier (tableA) and a column name (value1) is recorded at an arbitrary location in the page (at the end in the figure). When one column row does not fit on one page, it is necessary to record in another unit. However, pointer information to other units may be recorded in this place (storage area).
- the cell value is stored at an arbitrary address in the page. In FIG. 27A, cell values (each value of column name value1) are recorded in order from the top of the page.
- FIG. 27A it is recorded immediately before the management information in the same unit. There, the key information (or unique record ID) and the address (pointer) where it is stored are recorded.
- the information (Key: cc # 8) has a key: cc cell value of address # 8, (Key: ab # 4) has a key: ab cell value of address # 4, (Key: aa # 0) Records that the cell value of Key: aa is stored at address # 0.
- information of another column (value 2) of the same table may be recorded in another recording management unit (memory or HDD).
- another recording management unit memory or HDD
- it may be divided and arranged by a simpler method.
- a set having a key value and one or a plurality of data records for each column corresponding to the key value is used as a unit in the row direction, and row identification is performed using the key value.
- partitioning column store
- a hash value is obtained with a hash function using a character string obtained by combining the table identifier and the column identifier as an argument, and the hash value and
- data placement destination data nodes may be determined from the storage destination node list information by a consistent hash, and distributed to different data nodes in units of columns. Different data nodes may be stored in different data structures in the partitioning unit.
- FIG. 28 is a diagram schematically showing a case where the data arrangement nodes are distributed and arranged for each column of the table as the partitioning of the table. It is only necessary to pass a table identifier and a column name (for example, (tableA: value2) or (tableA: value3)) as a value to be given to the hash function. Output of the hash function for the argument (a storage node is calculated from the hash value.
- a key value and a set having one or a plurality of data records for each column corresponding to the key value is used as a unit in the row direction, and row identification is performed as a key.
- a hash function is used with a hash function using a character string that is a combination of the table identifier, the column identifier, and a unique suffix. It is also possible to obtain a value, determine the data placement destination data node from the hash value and the storage destination node list information by a consistent hash, and distribute one column to a plurality of data nodes. Different data structures may be stored in partitioning units among a plurality of data nodes at the arrangement destination.
- FIG. 29 is a diagram schematically showing a case where one column of the table is partitioned into two in FIG.
- a unique suffix such as a number is given in addition to the table identifier and the column name as a value to be given as an argument of the hash function, so that a plurality of types of data placement nodes (stores) Node).
- the key value is aa or acc
- the key is placed in the data placement node (storage node) 1
- the key value is dd or ee
- the key is placed in the data placement node (storage node) 2.
- the combination of the key value and the suffix (or a value that can be calculated) is stored in the structure information holding unit 92 in FIG.
- a suffix may be specified for each numerical range. For example, 1-100 sets the identifier to 0 (as a result, it is stored in the storage node 1). By doing so, it is possible to reduce the data capacity to be held and managed in the structure information holding unit 92.
- the column store method partitioning has been described, but the same applies to the row store method.
- a key value or the like is used instead of the column identifier.
- a plurality of data placement nodes that participate in the distributed storage system are divided into groups corresponding to the system operating state, and the data placement node that receives the data write request participates in the distributed storage system.
- data replications may be created for the number of data replications defined for each group. In this case, the number of data copies created for each group is determined, a hash ring in which a plurality of data placement nodes are logically placed is traced, and the number of data copies is reached until the specified number of data copies for each group is achieved.
- a destination may be searched and a list of replication destination data arrangement nodes may be created.
- a list of replication destination data arrangement nodes may be received and a replication command may be issued to each data arrangement node in the list.
- a company's information system is realized using a distributed storage system or database system.
- the system that provides services that are central to the company's business content is called “core system” or “core business system” Sales and inventory management systems, cash register POS systems (Point of sale system), etc. are included.
- a data warehouse is known as a system that performs data analysis in order to use (sometimes aggregate) information of these core systems for decision making of a company. Since these systems (core system, data warehouse) generally have different access characteristics for data, a database system is prepared and data structure is prepared so as to be suitable for each access characteristic (to perform high-speed processing). It has been done to specialize.
- the data warehouse system includes, for example, a large-scale database for extracting and reconstructing data (for example, transaction data, etc.) from a plurality of backbone systems and analyzing information and making decisions. It is necessary to perform data migration from the database of the backbone system to the data warehouse database, and this process is called ETL (Extract / Transform / Load). ETL is known to become a heavy load as the amount of data in both the backbone system and the data warehouse system increases, but by applying the present invention, the bottleneck of data structure conversion is eliminated, Storage usage efficiency can be increased.
- ETL Extract / Transform / Load
- the data storage system according to the present invention can be applied to parallel databases, parallel data processing systems, distributed storage, parallel file systems, distributed databases, data grids, and cluster computers.
- Each has a data storage unit and a plurality of data nodes that are network-coupled, and the data nodes to which data is replicated are logically the same among the data nodes, but physically different data structures.
- a distributed storage system including at least two data nodes held in each of the data storage units.
- Appendix 2 The distributed storage system according to appendix 1, wherein conversion to a target data structure is performed asynchronously with reception of replicated data in the replication target data node.
- the data node of the copy destination holds the duplicate data in an intermediate data holding structure and returns a response, and asynchronously converts the data structure held in the intermediate data holding structure to a target data structure Distributed storage system.
- the distributed storage system comprising means for variably controlling data nodes of data arrangement destinations, data structures at the arrangement destinations, and data divisions in predetermined table units.
- Appendix 5 The distributed storage system according to any one of appendices 1 to 4, further comprising means for obtaining a data node in which data is arranged by consistent hashing.
- a structural information management device having a structural information holding unit for storing and managing A client function realization unit including a data access unit that identifies an access destination of the update process and the reference process with reference to the data structure management information and the data arrangement identification information; A plurality of the data nodes each including the data storage unit and connected to the structural information management device and the client function realization unit; With The data node is A data management / processing unit that holds data in an intermediate data holding structure and returns
- Appendix 8 The distributed storage system according to appendix 7, wherein the intermediate data holding structure holds the data until the data is stored in the data storage unit as a designated target data structure.
- Appendix 9 The distributed storage system according to appendix 7, wherein the client function realization unit selects an access destination data node from the data structure management information and the data arrangement specifying information according to the contents of the update process or the reference process.
- the client function realization unit may be a structure information cache holding unit that caches the data arrangement specifying information held in the structure information holding unit of the structure information management device or information held in the structure information holding unit.
- the data node includes an access reception unit, an access processing unit, and a data structure conversion unit
- the data storage unit of the data node includes a structure-specific data storage unit
- the access receiving unit receives an update request from the client function realization unit, and transfers the update request to the data node specified corresponding to the replica identifier in the data arrangement specifying information
- the access processing unit of the data node processes the received update request, executes update processing with reference to the information of the data structure management information, and at this time, from the information of the data structure management information, the data
- the update trigger for the node is zero, update the update data to the data structure specified in the data structure management information and update the data storage unit by structure
- the update opportunity is not zero, the update data is once written in the intermediate data holding structure, and the process completion is responded.
- the access reception unit When the access reception unit receives the completion notification from the access processing unit and the completion notification of the replica data node, it responds to the client function realization unit,
- the data structure conversion unit converts the data of the intermediate data holding structure into a data structure specified in the data structure management information, and stores the data structure in the data storage unit classified by structure of the conversion destination. Distributed storage system.
- the client function implementation unit selects a data structure suitable for data access performed on a data node, specifies a replica identifier, calculates a data node to be accessed, and is selected.
- the distributed storage system according to appendix 7, wherein an access request is issued to the data node and an access processing result is received from the data node.
- Appendix 13 The distributed storage system according to appendix 7, wherein the client function realization unit is arranged in the data node.
- Appendix 14 14. The distributed storage system according to appendix 13, wherein the client function implementation unit includes a structure information cache holding unit that caches information held in the structure information holding unit.
- appendix 15 15.
- the data structure management information includes a partition number corresponding to a replica identifier, which is a division number for dividing and storing data into a plurality of data nodes
- the data placement specifying information includes a plurality of data nodes as placement nodes corresponding to replica identifiers corresponding to a partition number of 2 or more in the data structure management information
- the access receiving unit of the data node that has received the access request when an arrangement destination of the partitioned data extends over a plurality of data nodes, to an access processing unit of another data node that constitutes the plurality of data nodes.
- the distributed storage system according to appendix 7, which issues an access request.
- Appendix 18 A history recording unit for recording a history of access requests; A change determining unit that determines whether to convert the data structure using the history information of the history recording unit;
- the change determination unit When it is determined that the data structure needs to be converted, the change determination unit outputs a conversion request to the structure information change unit of the structure information management device, The structure information change unit of the structure information management device changes the information of the structure information holding unit, and outputs a conversion request to the data structure conversion unit of the data node,
- the distributed storage system according to appendix 18, wherein the data structure conversion unit of the data node converts a data structure held in the data storage unit of the data node.
- Appendix 20 A distributed storage method in a system comprising a plurality of data nodes each having a data storage unit and network-coupled, A distributed storage method in which at least two data nodes of data nodes to which data is replicated are logically the same among the data nodes but physically different data structures are held in the respective data storage units. .
- appendix 22 The appendix 21 according to appendix 21, wherein the data node at the duplication destination holds the duplicate data in the intermediate data holding structure and returns a response, and asynchronously converts the data structure held in the intermediate data holding structure to the target data structure.
- Distributed storage method The appendix 21 according to appendix 21, wherein the data node at the duplication destination holds the duplicate data in the intermediate data holding structure and returns a response, and asynchronously converts the data structure held in the intermediate data holding structure to the target data structure.
- Appendix 23 The distributed storage method according to appendix 21, wherein the data node of the data placement destination, the data structure at the placement destination, and the data division are variably controlled in predetermined table units.
- Appendix 24 24.
- the data node at the replication destination converts the data requested for update into a data structure different from the data structure in the specified target database, and converts the data to the data
- the data node temporarily stores the data to be updated, holds an intermediate structure and returns a response to the update, and converts the data to the target data structure asynchronously with the update request.
- the distributed storage method according to any one of appendices 20 to 24, wherein the distributed storage method is stored.
- Appendix 26 Corresponding to a table identifier that is an identifier for identifying data to be stored, a replica identifier that identifies a replica, data structure information that identifies a type of data structure corresponding to the replica identifier, and conversion to a specified data structure Update timing information that is time information until it is stored, and data structure management information provided corresponding to the number of types of the data structure, Corresponding to the table identifier, data placement specifying information comprising the replica identifier and one or more data placement destination data node information corresponding to the replica identifier, Is stored and managed in the structure information management unit, On the client side, referring to the data structure management information and the data arrangement specifying information, the access destination of the update process and the reference process is specified, The data node is When performing an update process based on an access request from the client side, the data is held in an intermediate data holding structure and a response is returned to the client. 26.
- the data structure management information is provided with a partition number corresponding to a replica identifier, which is a division number for dividing and storing data into a plurality of data nodes
- the data placement specifying information includes a plurality of data nodes as placement nodes corresponding to replica identifiers corresponding to a partition number of 2 or more in the data structure management information
- the data node that has received the access request issues an access request to another data node that constitutes the plurality of data nodes when the arrangement destination of the partitioned data extends over the plurality of data nodes.
- Appendix 28 The history information in the history recording unit that records the history in the access request is used to determine whether or not to convert the data structure. If conversion is necessary, the structure information is converted, and the data of the data node is further converted. 27. The distributed storage method according to appendix 26, wherein the structure is converted.
- a hash value is obtained by a hash function using a character string obtained by combining the column identifier and the table identifier as an argument, and a data node of a data placement destination is determined by a consistent hash from the hash value and storage destination node list information.
- the distributed storage system according to appendix 5.
- Appendix 30 A table having a key value and a set having one or a plurality of data records corresponding to the key value in one or a plurality of columns as a unit in the row direction, and a table having a column identifier assigned to each column.
- a hash value is obtained by a hash function using a character string combined with a column identifier as an argument, and a data node of a data arrangement destination is determined by a consistent hash from the hash value and storage destination node list information, and is separately for each column.
- the distributed storage system according to appendix 5, wherein the distributed storage system is distributed to the data nodes.
- Appendix 31 A table having a key value and a set having one or a plurality of data records corresponding to the key value in one or a plurality of columns as a unit in the row direction, and a table having a column identifier assigned to each column.
- a hash value is obtained by a hash function using a character string combining a column identifier and a unique suffix as an argument, and a data node of a data arrangement destination is determined by a consistent hash from the hash value and storage destination node list information.
- the distributed storage system according to appendix 5, wherein one column is distributed to a plurality of data nodes.
- Appendix 32 A table having one or a plurality of data records in one or a plurality of columns as a unit in the row direction, a column identifier assigned to each column, and a table with a unique record identifier for each record.
- a hash value is obtained with a hash function using a character string obtained by combining the column identifier and the record identifier as an argument, and a data node of a data arrangement destination is determined by a consistent hash from the hash value and storage destination node list information.
- the distributed storage system according to appendix 5.
- a hash value is obtained by a hash function using a character string combined with a column identifier as an argument, and a data node of a data arrangement destination is determined by a consistent hash from the hash value and storage destination node list information, and is separately for each column.
- 25. The distributed storage method according to appendix 24, wherein the distributed storage method is distributed to the data nodes.
- a hash value is obtained by a hash function using a character string combining a column identifier and a unique suffix as an argument, and a data node of a data arrangement destination is determined by a consistent hash from the hash value and storage destination node list information.
- 25 The distributed storage method according to appendix 24, wherein one column is distributed to a plurality of data nodes.
- a hash value is obtained with a hash function using a character string obtained by combining the column identifier and the record identifier as an argument, and a data node of a data arrangement destination is determined by a consistent hash from the hash value and storage destination node list information.
Abstract
Description
本発明は、日本国特許出願:特願2011-050151号(2011年3月8日出願)の優先権主張に基づくものであり、同出願の全記載内容は引用をもって本書に組み込み記載されているものとする。
複数の計算機(データノード、あるいは単に「ノード」ともいう)をネットワーク結合し、各計算機のデータ格納部(HDD(Hard Disk Drive)やメモリ等)にデータを格納して利用するシステムを実現する分散ストレージシステム(Distributed Storage System)が利用されている。
・データをどの計算機(ノード)に配置するか、
・処理をどの計算機(ノード)で行うか、
といった判断をソフトウェアや特別な専用ハードウェアにより実現し、システムの状態に対してその動作を動的に変更することでシステム内のリソース使用量を調整し、システム利用者(クライアント計算機)に対する性能を向上している。
分散ストレージシステムにおいて、クライアントがデータを保持しているノードを知るための技術の一つとしてメタサーバ方式が知られている。メタサーバ方式では、データの位置情報を管理する、一つ又は複数(ただし、少ない数)の計算機により構成されたメタサーバを設ける。
データを保持しているノードの位置を知るための別の手法(技術)として、分散関数(例えばハッシュ関数)を用いてデータの位置を求めるものがある。この種の手法は、例えば分散KVS(Key Value Store:キー・バリュー・ストア)と呼ばれている。
分散ストレージシステムにおいては、可用性(Availability:システムが連続して動作できる能力)確保のために、データの複製を複数ノードで保持し、データの複製を、負荷分散に活用することが一般的に行われている。
・異なる種類のアクセス負荷に対して高速に対応可能とし、
・可用性保持のための複数(レプリカ)を他の用途に利用可能とし、データ容量の効率利用を可能としている。
‐ロウストア(Row-store):
・追記型(データの格納領域に記録を追加)、
・更新型、
‐カラムストア(Column-store):
・圧縮の有無、
‐ライトログ(例えばライト性能を優先するために更新情報を追記するための構造):
‐インデックス(検索用の索引データ)の有無:
‐データの格納順をソート(Sorting)しているか:
‐分割(Partitioning)の有/無、分割数:
‐分割(Partitioning)単位、アルゴリズム:
等の項目について組み合わせが選択される。
・データの配置場所、
・データ配置(内部)構造、
・データを分散して格納するか、集中的に格納するかという格納方式、
等の複製データの保持格納形式について、可変に制御することができない。
・適切なデータ構造の選択、
・適切なスキーマの設計、
・適切なデータベースソフトウェア、設定の使い分け
を行う必要がある、ということである。いずれも、データベースシステムおよびストレージシステムに対して高い知見がユーザに要求されることから、これらをユーザ側で行うことは、実際上、困難である。
本発明の第1の例示的な実施形態について図面を参照して説明する。図1は、本発明の第1の実施形態のシステム構成の一例を示す図である。データノード1~4、ネットワーク5、クライアントノード6、構造情報管理手段(構造情報管理装置)9を備える。
図2は、本発明の第1の実施形態の変形例の構成を示す図である。図2に示す通り、データノード1、2、3、4の各々に、クライアント機能実現手段61が配設されている。
図4は、本発明の第1の実施形態の構成例をより詳細に説明する図である。図4には、図1のデータノード1~4を中心に示した構成が示されている。なお、図4等の図面において、簡単化のため、構造情報保持部92に格納される構造情報は参照符号92で参照される場合がある。
データ構造管理情報921は、データの集合毎にデータの格納方式を特定するためのパラメータ情報である。図5は、図4のデータ構造管理情報921の一例を示す図である。特に制限されるものではないが、本実施形態では、データの格納方式を制御する単位を、テーブルとする。そして、テーブル毎(テーブル識別子毎)に、レプリカ識別子、データ構造の種別、更新契機の各情報を、データ複製の複製数分、用意する。
A:キュー、
B:ロウストア、
C:カラムストア
が指定されている。
A:キュー(QUEUE)は、リンクトリスト(Linked List)である。
レプリカ識別子0と1のデータとして、データ構造B(ロウストア)で保持し(図6の(B)、(C)参照)、
レプリカ識別子2のデータとして、データ構造C(カラムストア)として保持する(図6の(D)参照)。
図7に示す例では、Writeにより、レプリカ識別子が0のデータノードにおいて、Write向け中間構造には、データ構造Aが格納保持され、レプリカ識別子1、2のデータノードに対して同期方式(Synchronous)で、Write向け中間構造に保持されたデータ構造Aのデータがレプリケート(複製)され、レプリカ識別子1、2のデータノードの各々において、Write向け中間構造にはデータ構造Aのデータが一旦格納保持される。レプリカ識別子0、1、2に対応するデータ構造にそれぞれ対応するデータノードにおいて、ターゲットのデータ構造B、B、Cへの変換は、図5(A)に示すようなデータ構造管理情報921の更新契機情報により指定される。
・ロウストア構造におけるインデックスの有無、
・インデックスを作成したカラムの種類の違い、
・更新を追記構造で格納するロウストア形式、
等であってもよい。
図8は、図4のデータ配置特定情報922の例を示す。各テーブル識別子のレプリカ識別子0、1、2毎に、配置ノードが指定されている。これは、前述したメタサーバ方式に対応している。
分散KVS方式の場合、データ配置特定情報922は、分散ストレージに参加しているノードリスト情報(不図示)が該当する。このノードリスト情報をデータノード間で共有することによって、「テーブル識別子」+「レプリカ識別子」をキー情報として、コンシステント・ハッシング方式により、配置ノードを特定することが出来る。また、レプリカの配置先として、コンシステント・ハッシング方式における隣接ノードに格納することができる。コンシステント・ハッシング方式は第4の実施形態で説明する。
本発明の第1の実施形態の動作について説明する。図9は、図1乃至図8を参照して説明した本発明の第1の実施形態におけるWrite処理(更新を伴う処理)のシーケンスを示す図である。
図11は、本発明の第1の実施形態における参照系処理(READ処理)のシーケンスを示す図である。
あるテーブルAからデータを読み出し、
当該データを用いた演算結果をテーブルBに更新する場合、
テーブルAからのデータ読み出しは参照系処理に該当する。
図12は、クライアント機能実現手段61の視点によるアクセス処理の動作を説明するフローチャートである。図12を参照して、クライアントのアクセスフローについて説明する。
・INSERT命令(テーブルへレコードを追加するSQL命令)であれば、Write処理、
・SELECT命令(テーブルからレコードを削除するSQL命令)であれば、参照系処理、
である。
図13は、図4のデータノードにおけるアクセス処理を説明するフローチャートである。図13、図4を参照して、データノードの動作について詳細に説明する。
図14は、図4のデータ構造変換手段113におけるデータ変換処理の動作を示すフローチャートである。図14、図4を参照して、データ変換処理を説明する。
本発明の第2の実施の形態について説明する。本発明の第2の実施の形態では、データを、所定単位で複数に分割して、複数のデータノードに格納できるようにしている。本実施形態のシステムの基本構成は、図1、図2、図4等に示した構成とされるが、図15、図16を参照して説明されるように、本実施形態においては、データ構造管理情報921、データ配置特定情報922の内容が拡張されている。また、図17を参照して説明されるように、本実施形態においては、データノードのアクセス受付手段が、アクセス処理手段にアクセス要求を発行するときに、他のデータノードのアクセス処理手段に対しても、アクセス要求を発行し、さらに、データ構造変換手段が、他のデータノードのデータ構造変換手段に対して、変更要求を発行する構成とされていることが、前記第1の実施形態と相違している。なお、本実施形態におけるデータノードの構成も、基本的には、図4に従うが、その詳細は図17を参照して後述される。
レプリカ識別子1は、ノード番号7-18、
に分割して格納するものとする。
・レプリカ識別子0のノード番号1-6には、カラムの値の前半、ノード番号7-12にはカラムの値の後半、
・レプリカ識別子1のノード番号7-12には、カラムの値の前半、ノード番号13-18には、カラムの値の後半
というように格納することで、同一のレコードが、同一のノードに格納されることは回避される。このようにすることで、配置ノードの割り当てを重複させながら、可用性を満たすことが出来る。
次に、本発明の第3の実施形態について説明する。本実施形態では、データ構造管理情報921をアクセス負荷に応じて変更するようにしている。変更された値をシステムのデータ構造に反映することで、データ構造の設定内容(図5に示したようなレプリカ識別子毎のデータ構造の割り当て)の不適切さの修正や、システム運用後のアクセスパターンの変化などに対応可能とする。これを実現する制御パラメータの自律変更の動作について説明する。
図19は、図18に示した本実施形態における制御動作を説明するフローチャートである。図19の動作を、例えば定期的に行うことによって、システムのデータ構造を自律的に変更・反映することが出来る。実行周期は、任意であるが、例えば周期を長くした場合、実行中の変更処理と、整合を取る必要がある。また、周期的な実行以外にも、所定のイベント検出に応答して変更処理を行うようにしてもよい。イベントとしては、例えばシステムの任意のいずれかの構成要素により、負荷の変更を検出(例として、一部のデータノードのCPU、ディスクなどのハードウェア利用率の大きな変化など)した場合等である。
<条件> then <アクション>(条件成立時アクションを実行)のif-then構造とされる。
図20は、本実施形態における、データ構造変換の動作を説明するフローチャートである。
図21は、図18に示した本実施形態における変換処理中のデータノード内の処理を説明する図である。図18のデータ構造変換手段113でデータ構造の変換処理中(ステップS502-504)において、アクセス処理手段112は、アクセス要求を、データ構造Aとデータ構造Bを用いて、アクセス要求を応答する。このとき、更新処理は、データ構造A(Write向け中間構造)に保持しておき、データ構造変換手段113で変換処理中は、データ構造B(Row-Store)への適用を行わない。
図22、図23は、本実施形態において、パーティション数を変更する動作を説明するフローチャートである。パーティション数の変更処理は、図19と同一のフローチャートとして表現できる。以下では、図22について、図19との相違点に着目して説明する。また、パーティション数だけでなく、分散戦略を変更してもよい。分散戦略の変更の一例として、例えばラウンドロビンによる分散から、任意のカラムの値範囲による分散への変更、あるいはその逆等があげられる。
図23に、図22のステップS605(変換判定手段72によるパーティショニング数変更処理)のフローを示す。以下では、図23について、図20と異なる点に着目して説明する。
本発明の第4の実施形態としてコンシステント・ハッシングへの適用例を説明する。以下では、テーブルAをカラムストア形式でコンシステント・ハッシング分割配置する場合の例について、図26を用いて説明する。なお、本実施形態において、コンシステント・ハッシングで、データが配置されるデータノード(データ配置ノード)を決める処理は、図18の変更判定手段72で行うようにしてもよい。ノード情報は、変更判定手段72により、構造情報保持部92に記録される。特に制限されないが、本実施形態においては、キー値(Key)と、前記キー値に対応してカラム毎に1又は複数のデータレコードを有するセットをロウ方向の単位とし、ロウの識別はキー値(Key)で行われ、各カラムにカラム識別子(Value1、Value2、・・・)が付与されたテーブルに関して、前記キー値と、カラム識別子と、テーブル識別子を組み合せた文字列を引数としてハッシュ関数でハッシュ値を求め、前記ハッシュ値と、格納先ノードリスト情報から、コンシステントハッシュにより、データ配置先のデータノードを決定する。
テーブル識別子+カラム名+Key値
を組み合わせた文字列(テーブル識別子:tableA+カラム識別子Value2+Key値:acc)を渡し、ハッシュ値が算出される。
テーブル識別子+カラム名+レコードID
をハッシュ関数に渡す引数としてもよい。
それぞれがデータ格納部を備え、ネットワーク結合される複数のデータノードを備え、データ複製先のデータノードが、前記データノード間で、論理的には同一であるが、物理的には異なるデータ構造をそれぞれの前記データ格納部に保持する、少なくとも二つのデータノードを含む、分散ストレージシステム。
複製先の前記データノードにおいて、目的のデータ構造への変換を複製データの受付とは非同期で行う、付記1記載の分散ストレージシステム。
複製先の前記データノードにおいて、中間データ保持構造に前記複製データを保持して応答を返し、前記中間データ保持構造に保持されるデータ構造を、目的のデータ構造に非同期で変換する、付記2記載の分散ストレージシステム。
予め定められたテーブル単位でデータの配置先のデータノード、配置先でのデータ構造、データ分割を可変に制御する手段を備えた付記2記載の分散ストレージシステム。
データが配置されるデータノードを、コンシステント・ハッシングで求める手段を備えた、付記1乃至4のいずれか1に記載の分散ストレージシステム。
データ更新時に行われるデータの複製において、前記複製先のデータノードでは、更新要求対象のデータを、それぞれ、指定されたデータベースでのデータ構造とは異なるデータ構造に変換してデータを前記データ格納部に格納し、その際、前記データノードは、更新対象のデータを、一旦、中間データ保持構造を保持して前記更新に対する応答を返し、前記更新要求とは非同期で目的のデータ構造に変換して格納する、付記1乃至5のいずれか1に記載の分散ストレージシステム。
格納対象のデータを識別する識別子であるテーブル識別子に対応させて、複製を特定するレプリカ識別子と、前記レプリカ識別子に対応したデータ構造の種類を特定するデータ構造情報と、指定されたデータ構造に変換して格納されるまでの時間情報である更新契機情報と、を、前記データ構造の種類の数に対応させて備えたデータ構造管理情報と、
前記テーブル識別子に対応して、前記レプリカ識別子と、前記レプリカ識別子に対応した1つ又は複数のデータ配置先のデータノード情報と、を備えたデータ配置特定情報と、
を記憶管理する構造情報保持部を有する構造情報管理装置と、
前記データ構造管理情報と前記データ配置特定情報とを参照して、更新処理及び参照処理のアクセス先を特定するデータアクセス部を備えたクライアント機能実現部と、
それぞれが前記データ格納部を備え、前記構造情報管理装置と前記クライアント機能実現部とに接続される複数の前記データノードと、
を備え、
前記データノードは、
前記クライアント機能実現部からのアクセス要求に基づき、更新処理を行う場合に、中間データ保持構造にデータを保持して前記クライアント機能実現部に応答を返すデータ管理・処理部と、
前記データ構造管理情報を参照し、指定された更新契機に応答して、前記中間データ保持構造に保持されるデータを、前記データ構造管理情報で指定されたデータ構造に変換する処理を行うデータ構造変換部と、
を備えることを特徴とする、付記1乃至6のいずれか1に記載の分散ストレージシステム。
前記中間データ保持構造は、指定された目的のデータ構造としてデータが前記データ格納部に格納されるまでの間、前記データを保持する、付記7記載の分散ストレージシステム。
前記クライアント機能実現部が、前記更新処理又は前記参照処理の内容に応じてアクセス先のデータノードを、前記データ構造管理情報と前記データ配置特定情報より選択する、付記7記載の分散ストレージシステム。
前記クライアント機能実現部は、前記構造情報管理装置の前記構造情報保持部に保持されている前記データ配置特定情報、又は、前記構造情報保持部に保持される情報をキャッシュする構造情報キャッシュ保持部に保持されているデータ配置特定情報を取得し、データ配置先のデータノードに対して、アクセス命令を発行する、付記7記載の分散ストレージシステム。
前記データノードは、アクセス受付部、アクセス処理部、データ構造変換部を備え、
前記データノードの前記データ格納部は、構造別データ格納部を備え、
前記アクセス受付部は、前記クライアント機能実現部からの更新要求を受け付け、前記データ配置特定情報においてレプリカ識別子に対応して指定されているデータノードに対して更新要求を転送し、
前記データノードの前記アクセス処理部は、受け取った更新要求の処理を行い、前記データ構造管理情報の情報を参照して更新処理を実行し、その際、前記データ構造管理情報の情報から、前記データノードに対する前記更新契機が零の場合、更新データを、前記データ構造管理情報に指定されるデータ構造に変換して前記構造別データ格納部を更新し、
前記更新契機が零でない場合、前記中間データ保持構造に、一旦、更新データを書き込み、処理完了を応答し、
前記アクセス受付部は、前記アクセス処理部からの完了通知と、レプリカ先のデータノードの完了通知を受けると、前記クライアント機能実現部に対して応答し、
前記データ構造変換部は、前記中間データ保持構造のデータを、前記データ構造管理情報に指定されているデータ構造に変換し変換先の前記構造別データ格納部に格納する、付記7又は10記載の分散ストレージシステム。
前記クライアント機能実現部は、参照系アクセスの場合、データノードに対して行われるデータアクセスに適しているデータ構造を選択し、レプリカ識別子を特定した後、アクセスすべきデータノードを算出し、選択されたデータノードに対してアクセス要求を発行し前記データノードからアクセス処理結果を受け取る、付記7記載の分散ストレージシステム。
前記クライアント機能実現部が、前記データノード内に配設されている、付記7記載の分散ストレージシステム。
前記クライアント機能実現部が、前記構造情報保持部に保持される情報をキャッシュする構造情報キャッシュ保持部を備えた付記13記載の分散ストレージシステム。
前記クライアント機能実現部の前記構造情報キャッシュ保持部の構造情報と、前記構造情報管理装置の前記構造情報保持部に保持される構造情報を同期させる構造情報同期部を備えた付記14記載の分散ストレージシステム。
前記データ構造管理情報が、データを複数のデータノードに分割して格納する分割数であるパーティション数をレプリカ識別子に対応して備え、
前記データ配置特定情報は、前記データ構造管理情報においてパーティション数が2以上に対応するレプリカ識別子に対応した配置ノードとして、複数のデータノードを含み、
アクセス要求を受けた前記データノードの前記アクセス受付部は、パーティショニングされたデータの配置先が複数のデータノードにまたがる場合に、前記複数のデータノードを構成する他のデータノードのアクセス処理部にアクセス要求を発行する、付記7記載の分散ストレージシステム。
アクセス要求を受けた前記データノードの前記データ構造変換部は、前記更新契機が零のとき、他のデータノードの前記データ構造変換部に対してアクセス要求を発行する、付記7又は11記載の分散ストレージシステム。
アクセス要求の履歴を記録する履歴記録部と、
前記履歴記録部の履歴情報を用いてデータ構造の変換を行うか否かを判定する変更判定部と、
を備えた付記7記載の分散ストレージシステム。
前記変更判定部は、データ構造の変換が必要と判定した場合、前記構造情報管理装置の前記構造情報変更部に変換要求を出力し、
前記構造情報管理装置の前記構造情報変更部は、前記構造情報保持部の情報を変更し、前記データノードの前記データ構造変換部に変換要求を出力し、
前記データノードの前記データ構造変換部は前記データノードの前記データ格納部に保持されるデータ構造の変換を行う、付記18記載の分散ストレージシステム。
それぞれがデータ格納部を備え、ネットワーク結合される複数のデータノードを備えたシステムでの分散ストレージ方法であって、
データ複製先のデータノードの少なくとも二つのデータノードが、前記データノード間で、論理的には同一であるが、物理的には異なるデータ構造をそれぞれの前記データ格納部に保持する、分散ストレージ方法。
複製先の前記データノードにおいて、目的のデータ構造への変換を複製データの受付とは非同期で行う、付記20記載の分散ストレージ方法。
複製先の前記データノードにおいて、中間データ保持構造に複製データを保持して応答を返し、前記中間データ保持構造に保持されるデータ構造を、目的のデータ構造に非同期で変換する、付記21記載の分散ストレージ方法。
予め定められたテーブル単位でデータの配置先のデータノード、配置先でのデータ構造、データ分割を可変に制御する、付記21記載の分散ストレージ方法。
データが配置されるデータノードをコンシステント・ハッシングで求める、付記20乃至23のいずれか1に記載の分散ストレージ方法。
データ更新時に行われるデータの複製において、前記複製先のデータノードでは、更新要求対象のデータを、それぞれ、指定された目的のデータベースでのデータ構造とは異なるデータ構造に変換してデータを前記データ格納部に格納し、その際、前記データノードは、更新対象のデータを一旦、中間構造を保持して前記更新に対する応答を返し、前記更新要求とは非同期で、目的のデータ構造に変換して格納する、付記20乃至24のいずれか1に記載の分散ストレージ方法。
格納対象のデータを識別する識別子であるテーブル識別子に対応させて、複製を特定するレプリカ識別子と、前記レプリカ識別子に対応したデータ構造の種類を特定するデータ構造情報と、指定されたデータ構造に変換して格納されるまでの時間情報である更新契機情報と、を前記データ構造の種類の数に対応させて備えたデータ構造管理情報と、
前記テーブル識別子に対応して、前記レプリカ識別子と、前記レプリカ識別子に対応した1つ又は複数のデータ配置先のデータノード情報と、を備えたデータ配置特定情報と、
を含む構造情報を構造情報管理部で記憶管理し、
クライアント側では、前記データ構造管理情報と前記データ配置特定情報を参照して、更新処理及び参照処理のアクセス先を特定し、
前記データノードは、
前記クライアント側からのアクセス要求に基き、更新処理を行う場合に、中間データ保持構造にデータを保持して前記クライアントに応答を返し、
前記データ構造管理情報を参照し、指定された更新契機に応じて、前記中間データ保持構造から指定されたデータ構造に変換する、ことを特徴とする、付記25記載の分散ストレージ方法。
前記データ構造管理情報が、データを複数のデータノードに分割して格納する分割数であるパーティション数を、レプリカ識別子に対応して備え、
前記データ配置特定情報は、前記データ構造管理情報においてパーティション数が2以上に対応するレプリカ識別子に対応した配置ノードとして、複数のデータノードを含み、
アクセス要求を受けた前記データノードでは、パーティショニングされたデータの配置先が複数のデータノードにまたがる場合に、前記複数のデータノードを構成する他のデータノードに対してアクセス要求を発行する、付記26記載の分散ストレージ方法。
アクセス要求に履歴を記録する履歴記録部での履歴情報を用いて、データ構造の変換を行うか否かを判定し、変換が必要な場合、前記構造情報を変換し、さらに前記データノードのデータ構造を変換する、付記26記載の分散ストレージ方法。
キー値と、前記キー値に対応して1又は複数のデータレコードを1又は複数のカラムに有するセットをロウ方向の単位とし、各カラムにカラム識別子が付与されたテーブルに関して、前記キー値と、前記カラム識別子と、前記テーブル識別子を組み合せた文字列を引数としてハッシュ関数でハッシュ値を求め、前記ハッシュ値と、格納先ノードリスト情報から、コンシステントハッシュにより、データ配置先のデータノードを決定する、付記5記載の分散ストレージシステム。
キー値と、前記キー値に対応して1又は複数のデータレコードを1又は複数のカラムに有するセットをロウ方向の単位とし、各カラムにカラム識別子が付与されたテーブルに関して、前記テーブル識別子と前記カラム識別子とを組み合せた文字列を引数としてハッシュ関数でハッシュ値を求め、前記ハッシュ値と、格納先ノードリスト情報から、コンシステントハッシュにより、データ配置先のデータノードを決定し、カラム単位で別々のデータノードに分散配置する、付記5記載の分散ストレージシステム。
キー値と、前記キー値に対応して1又は複数のデータレコードを1又は複数のカラムに有するセットをロウ方向の単位とし、各カラムにカラム識別子が付与されたテーブルに関して、前記テーブル識別子と前記カラム識別子と一義的な接尾子とを組み合せた文字列を引数としてハッシュ関数でハッシュ値を求め、前記ハッシュ値と、格納先ノードリスト情報から、コンシステントハッシュにより、データ配置先のデータノードを決定し、1つのカラムを、複数のデータノードに分散配置する、付記5記載の分散ストレージシステム。
1又は複数のデータレコードを1又は複数のカラムに有するセットをロウ方向の単位とし、各カラムにカラム識別子が付与され、レコード毎に一義的なレコード識別子が付与されたテーブルに関して、前記テーブル識別子と前記カラム識別子と前記レコード識別子を組み合せた文字列を引数としてハッシュ関数でハッシュ値を求め、前記ハッシュ値と、格納先ノードリスト情報から、コンシステントハッシュにより、データ配置先のデータノードを決定する、付記5記載の分散ストレージシステム。
キー値と、前記キー値に対応して1又は複数のデータレコードを1又は複数のカラムに有するセットをロウ方向の単位とし、各カラムにカラム識別子が付与されたテーブルに関して、前記キー値と、前記カラム識別子と、前記テーブル識別子を組み合せた文字列を引数としてハッシュ関数でハッシュ値を求め、前記ハッシュ値と、格納先ノードリスト情報から、コンシステントハッシュにより、データ配置先のデータノードを決定する、付記24記載の分散ストレージ方法。
キー値と、前記キー値に対応して1又は複数のデータレコードを1又は複数のカラムに有するセットをロウ方向の単位とし、各カラムにカラム識別子が付与されたテーブルに関して、前記テーブル識別子と前記カラム識別子とを組み合せた文字列を引数としてハッシュ関数でハッシュ値を求め、前記ハッシュ値と、格納先ノードリスト情報から、コンシステントハッシュにより、データ配置先のデータノードを決定し、カラム単位で別々のデータノードに分散配置する、付記24記載の分散ストレージ方法。
キー値と、前記キー値に対応して1又は複数のデータレコードを1又は複数のカラムに有するセットをロウ方向の単位とし、各カラムにカラム識別子が付与されたテーブルに関して、前記テーブル識別子と前記カラム識別子と一義的な接尾子とを組み合せた文字列を引数としてハッシュ関数でハッシュ値を求め、前記ハッシュ値と、格納先ノードリスト情報から、コンシステントハッシュにより、データ配置先のデータノードを決定し、1つのカラムを、複数のデータノードに分散配置する、付記24記載の分散ストレージ方法。
1又は複数のデータレコードを1又は複数のカラムに有するセットをロウ方向の単位とし、各カラムにカラム識別子が付与され、レコード毎に一義的なレコード識別子が付与されたテーブルに関して、前記テーブル識別子と前記カラム識別子と前記レコード識別子を組み合せた文字列を引数としてハッシュ関数でハッシュ値を求め、前記ハッシュ値と、格納先ノードリスト情報から、コンシステントハッシュにより、データ配置先のデータノードを決定する、付記24記載の分散ストレージ方法。
5 ネットワーク
6 クライアントノード
9 構造情報管理手段(構造情報管理装置)
11、21、31、41 データ管理・処理手段(データ管理・処理部)
12、22、32、42 データ格納部
61 クライアント機能実現手段(クライアント機能実現部)
71 履歴記録部
72 変更判定手段(変更判定部)
91 構造情報変更手段(構造情報変更部)
92 構造情報保持部
93 構造情報同期手段(構造情報同期部)
101~104 データノード計算機
101a CPU
101b データ記憶装置
101c データ転送装置
105 ネットワーク
111 アクセス受付手段(アクセス受付部)
112 アクセス処理手段(アクセス処理部)
113 データ構造変換手段(データ構造変換部)
121、122、123、12X 構造別データ格納部
611 データアクセス手段(データアクセス部)
612 構造情報キャッシュ保持部
921 データ構造管理情報
922 データ配置特定情報
Claims (30)
- それぞれがデータ格納部を備え、ネットワーク結合される複数のデータノードを備え、
データ複製先のデータノードが、前記データノード間で、論理的には同一であるが、物理的には異なるデータ構造をそれぞれの前記データ格納部に保持する、少なくとも二つのデータノードを含む、分散ストレージシステム。 - 複製先の前記データノードにおいて、目的のデータ構造への変換を複製データの受付とは非同期で行う、請求項1記載の分散ストレージシステム。
- 複製先の前記データノードにおいて、中間データ保持構造に前記複製データを保持して応答を返し、前記中間データ保持構造に保持されるデータ構造を、目的のデータ構造に非同期で変換する、請求項2記載の分散ストレージシステム。
- 予め定められたテーブル単位でデータの配置先のデータノード、配置先でのデータ構造、データ分割を可変に制御する手段を備えた請求項2記載の分散ストレージシステム。
- データが配置されるデータノードを、コンシステント・ハッシングで求める手段を備えた、請求項1乃至4のいずれか1項に記載の分散ストレージシステム。
- データ更新時に行われるデータの複製において、前記複製先のデータノードでは、更新対象のデータを、それぞれ、指定されたデータ構造に変換して前記データ格納部に格納し、その際、前記データノードは、更新対象のデータを、一旦、中間データ保持構造を保持して前記更新に対する応答を返し、更新要求とは非同期で目的のデータ構造に変換して格納する、請求項1乃至5のいずれか1項に記載の分散ストレージシステム。
- 格納対象のデータを識別する識別子であるテーブル識別子に対応させて、複製を特定するレプリカ識別子と、前記レプリカ識別子に対応したデータ構造の種類を特定するデータ構造情報と、指定されたデータ構造に変換して格納されるまでの時間情報である更新契機情報と、を、前記データ構造の種類の数に対応させて備えたデータ構造管理情報と、
前記テーブル識別子に対応して、前記レプリカ識別子と、前記レプリカ識別子に対応した1つ又は複数のデータ配置先のデータノード情報と、を備えたデータ配置特定情報と、
を記憶管理する構造情報保持部を有する構造情報管理装置と、
前記データ構造管理情報と前記データ配置特定情報とを参照して、更新処理及び参照処理のアクセス先を特定するデータアクセス部を備えたクライアント機能実現部と、
それぞれが前記データ格納部を備え、前記構造情報管理装置と前記クライアント機能実現部とに接続される複数の前記データノードと、
を備え、
前記データノードは、
前記クライアント機能実現部からのアクセス要求に基づき、更新処理を行う場合に、中間データ保持構造にデータを保持して前記クライアント機能実現部に応答を返すデータ管理・処理部と、
前記データ構造管理情報を参照し、指定された更新契機に応答して、前記中間データ保持構造に保持されるデータを、前記データ構造管理情報で指定されたデータ構造に変換する処理を行うデータ構造変換部と、
を備えることを特徴とする、請求項1乃至6のいずれか1項に記載の分散ストレージシステム。 - 前記中間データ保持構造は、指定された目的のデータ構造としてデータが前記データ格納部に格納されるまでの間、前記データを保持する、請求項7記載の分散ストレージシステム。
- 前記クライアント機能実現部が、前記更新処理又は前記参照処理の内容に応じてアクセス先のデータノードを、前記データ構造管理情報と前記データ配置特定情報より選択する、請求項7記載の分散ストレージシステム。
- 前記クライアント機能実現部は、前記構造情報管理装置の前記構造情報保持部に保持されている前記データ配置特定情報、又は、前記構造情報保持部に保持される情報をキャッシュする構造情報キャッシュ保持部に保持されているデータ配置特定情報を取得し、データ配置先のデータノードに対して、アクセス命令を発行する、請求項7記載の分散ストレージシステム。
- 前記データノードは、アクセス受付部、アクセス処理部、データ構造変換部を備え、
前記データノードの前記データ格納部は、構造別データ格納部を備え、
前記アクセス受付部は、前記クライアント機能実現部からの更新要求を受け付け、前記データ配置特定情報においてレプリカ識別子に対応して指定されているデータノードに対して更新要求を転送し、
前記データノードの前記アクセス処理部は、受け取った更新要求の処理を行い、前記データ構造管理情報の情報を参照して更新処理を実行し、その際、前記データ構造管理情報の情報から、前記データノードに対する前記更新契機が零の場合、更新データを、前記データ構造管理情報に指定されるデータ構造に変換して前記構造別データ格納部を更新し、
前記更新契機が零でない場合、前記中間データ保持構造に、一旦、更新データを書き込み、処理完了を応答し、
前記アクセス受付部は、前記アクセス処理部からの完了通知と、レプリカ先のデータノードの完了通知を受けると、前記クライアント機能実現部に対して応答し、
前記データ構造変換部は、前記中間データ保持構造のデータを、前記データ構造管理情報に指定されているデータ構造に変換し変換先の前記構造別データ格納部に格納する、請求項7又は10記載の分散ストレージシステム。 - 前記クライアント機能実現部は、参照系アクセスの場合、データノードに対して行われるデータアクセスに適しているデータ構造を選択し、レプリカ識別子を特定した後、アクセスすべきデータノードを算出し、選択されたデータノードに対してアクセス要求を発行し前記データノードからアクセス処理結果を受け取る、請求項7記載の分散ストレージシステム。
- 前記クライアント機能実現部が、前記データノード内に配設されている、請求項7記載の分散ストレージシステム。
- 前記クライアント機能実現部が、前記構造情報保持部に保持される情報をキャッシュする構造情報キャッシュ保持部を備えた請求項13記載の分散ストレージシステム。
- 前記クライアント機能実現部の前記構造情報キャッシュ保持部の構造情報と、前記構造情報管理装置の前記構造情報保持部に保持される構造情報を同期させる構造情報同期部を備えた請求項14記載の分散ストレージシステム。
- 前記データ構造管理情報が、データを複数のデータノードに分割して格納する分割数であるパーティション数をレプリカ識別子に対応して備え、
前記データ配置特定情報は、前記データ構造管理情報においてパーティション数が2以上に対応するレプリカ識別子に対応した配置ノードとして、複数のデータノードを含み、
アクセス要求を受けた前記データノードの前記アクセス受付部は、パーティショニングされたデータの配置先が複数のデータノードにまたがる場合に、前記複数のデータノードを構成する他のデータノードのアクセス処理部にアクセス要求を発行する、請求項7記載の分散ストレージシステム。 - アクセス要求を受けた前記データノードの前記データ構造変換部は、前記更新契機が零のとき、他のデータノードの前記データ構造変換部に対してアクセス要求を発行する、請求項7又は11記載の分散ストレージシステム。
- アクセス要求の履歴を記録する履歴記録部と、
前記履歴記録部の履歴情報を用いてデータ構造の変換を行うか否かを判定する変更判定部と、
を備えた請求項7記載の分散ストレージシステム。 - 前記変更判定部は、データ構造の変換が必要と判定した場合、前記構造情報管理装置の前記構造情報変更部に変換要求を出力し、
前記構造情報管理装置の前記構造情報変更部は、前記構造情報保持部の情報を変更し、前記データノードの前記データ構造変換部に変換要求を出力し、
前記データノードの前記データ構造変換部は前記データノードの前記データ格納部に保持されるデータ構造の変換を行う、請求項18記載の分散ストレージシステム。 - それぞれがデータ格納部を備え、ネットワーク結合される複数のデータノードを備えたシステムでの分散ストレージ方法であって、
データ複製先のデータノードの少なくとも二つのデータノードが、前記データノード間で、論理的には同一であるが、物理的には異なるデータ構造をそれぞれの前記データ格納部に保持する、分散ストレージ方法。 - 複製先の前記データノードにおいて、目的のデータ構造への変換を複製データの受付とは非同期で行う、請求項20記載の分散ストレージ方法。
- 複製先の前記データノードにおいて、中間データ保持構造に複製データを保持して応答を返し、前記中間データ保持構造に保持されるデータ構造を、目的のデータ構造に非同期で変換する、請求項21記載の分散ストレージ方法。
- 予め定められたテーブル単位でデータの配置先のデータノード、配置先でのデータ構造、データ分割を可変に制御する、請求項21記載の分散ストレージ方法。
- データが配置されるデータノードをコンシステント・ハッシングで求める、請求項20乃至23のいずれか1項に記載の分散ストレージ方法。
- データ更新時に行われるデータの複製において、前記複製先のデータノードでは、更新対象のデータを、それぞれ、指定された目的のデータ構造に変換して前記データ格納部に格納し、その際、前記データノードは、更新対象のデータを一旦、中間構造を保持して前記更新に対する応答を返し、更新要求とは非同期で、目的のデータ構造に変換して格納する、請求項20乃至24のいずれか1項に記載の分散ストレージ方法。
- 格納対象のデータを識別する識別子であるテーブル識別子に対応させて、複製を特定するレプリカ識別子と、前記レプリカ識別子に対応したデータ構造の種類を特定するデータ構造情報と、指定されたデータ構造に変換して格納されるまでの時間情報である更新契機情報と、を前記データ構造の種類の数に対応させて備えたデータ構造管理情報と、
前記テーブル識別子に対応して、前記レプリカ識別子と、前記レプリカ識別子に対応した1つ又は複数のデータ配置先のデータノード情報と、を備えたデータ配置特定情報と、
を含む構造情報を構造情報管理部で記憶管理し、
クライアント側では、前記データ構造管理情報と前記データ配置特定情報を参照して、更新処理及び参照処理のアクセス先を特定し、
前記データノードは、
前記クライアント側からのアクセス要求に基き、更新処理を行う場合に、中間データ保持構造にデータを保持して前記クライアントに応答を返し、
前記データ構造管理情報を参照し、指定された更新契機に応じて、前記中間データ保持構造から指定されたデータ構造に変換する、
を備えることを特徴とする、請求項25記載の分散ストレージ方法。 - 前記データ構造管理情報が、データを複数のデータノードに分割して格納する分割数であるパーティション数を、レプリカ識別子に対応して備え、
前記データ配置特定情報は、前記データ構造管理情報においてパーティション数が2以上に対応するレプリカ識別子に対応した配置ノードとして、複数のデータノードを含み、
アクセス要求を受けた前記データノードでは、パーティショニングされたデータの配置先が複数のデータノードにまたがる場合に、前記複数のデータノードを構成する他のデータノードに対してアクセス要求を発行する、請求項26記載の分散ストレージ方法。 - アクセス要求に履歴を記録する履歴記録部での履歴情報を用いて、データ構造の変換を行うか否かを判定し、変換が必要な場合、前記構造情報を変換し、さらに前記データノードのデータ構造を変換する、請求項26記載の分散ストレージ方法。
- データ格納部を備え、他のデータノードとネットワーク結合され、複数のデータノードが分散ストレージシステムを構成し、
更新対象のデータを複数のデータノードに複製する場合、前記データに関して、少なくとも一つの他のデータノードとの間で、論理的には同一であるが、物理的には異なるデータ構造を前記データ格納部に保持するデータノード装置。 - 前記更新対象のデータを、一旦、中間データ保持構造に保持して更新要求に対する応答を返し、前記更新要求とは非同期で、指定されたデータ構造に変換し前記データ格納部に格納する請求項29記載のデータノード装置。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/003,658 US9342574B2 (en) | 2011-03-08 | 2012-03-08 | Distributed storage system and distributed storage method |
JP2013503590A JP5765416B2 (ja) | 2011-03-08 | 2012-03-08 | 分散ストレージシステムおよび方法 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2011050151 | 2011-03-08 | ||
JP2011-050151 | 2011-03-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2012121316A1 true WO2012121316A1 (ja) | 2012-09-13 |
Family
ID=46798271
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2012/055917 WO2012121316A1 (ja) | 2011-03-08 | 2012-03-08 | 分散ストレージシステムおよび方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US9342574B2 (ja) |
JP (1) | JP5765416B2 (ja) |
WO (1) | WO2012121316A1 (ja) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5342055B1 (ja) * | 2012-10-30 | 2013-11-13 | 株式会社東芝 | 記憶装置およびデータバックアップ方法 |
JP2015095246A (ja) * | 2013-11-14 | 2015-05-18 | 日本電信電話株式会社 | 情報処理システム、管理装置、サーバ装置及びキー割当プログラム |
WO2015118865A1 (ja) * | 2014-02-05 | 2015-08-13 | 日本電気株式会社 | 情報処理装置、情報処理システム及びデータアクセス方法 |
JP2015184685A (ja) * | 2014-03-20 | 2015-10-22 | 日本電気株式会社 | 情報記憶システム |
US10318506B2 (en) | 2015-03-18 | 2019-06-11 | Nec Corporation | Database system |
CN111988359A (zh) * | 2020-07-15 | 2020-11-24 | 中科物缘科技(杭州)有限公司 | 基于消息队列的数据分片同步方法及系统 |
Families Citing this family (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9609060B2 (en) * | 2011-08-02 | 2017-03-28 | Nec Corporation | Distributed storage system and method |
US9069835B2 (en) * | 2012-05-21 | 2015-06-30 | Google Inc. | Organizing data in a distributed storage system |
US9774676B2 (en) | 2012-05-21 | 2017-09-26 | Google Inc. | Storing and moving data in a distributed storage system |
US9230000B1 (en) | 2012-06-04 | 2016-01-05 | Google Inc. | Pipelining Paxos state machines |
US9747310B2 (en) | 2012-06-04 | 2017-08-29 | Google Inc. | Systems and methods of increasing database access concurrency using granular timestamps |
US9449006B2 (en) | 2012-06-04 | 2016-09-20 | Google Inc. | Method and system for deleting obsolete files from a file system |
US9298576B2 (en) | 2012-06-04 | 2016-03-29 | Google Inc. | Collecting processor usage statistics |
US9195611B2 (en) | 2012-06-04 | 2015-11-24 | Google Inc. | Efficiently updating and deleting data in a data storage system |
US9659038B2 (en) | 2012-06-04 | 2017-05-23 | Google Inc. | Efficient snapshot read of a database in a distributed storage system |
CN103514229A (zh) | 2012-06-29 | 2014-01-15 | 国际商业机器公司 | 用于在分布式数据库系统中处理数据库数据的方法和装置 |
KR102044023B1 (ko) * | 2013-03-14 | 2019-12-02 | 삼성전자주식회사 | 키 값 기반 데이터 스토리지 시스템 및 이의 운용 방법 |
US9430545B2 (en) | 2013-10-21 | 2016-08-30 | International Business Machines Corporation | Mechanism for communication in a distributed database |
EP3048775B1 (en) * | 2014-05-29 | 2018-03-14 | Huawei Technologies Co. Ltd. | Service processing method, related device and system |
US10157214B1 (en) * | 2014-11-19 | 2018-12-18 | Amazon Technologies, Inc. | Process for data migration between document stores |
US9727742B2 (en) * | 2015-03-30 | 2017-08-08 | Airbnb, Inc. | Database encryption to provide write protection |
US20180075122A1 (en) * | 2015-04-06 | 2018-03-15 | Richard Banister | Method to Federate Data Replication over a Communications Network |
US9619210B2 (en) | 2015-05-14 | 2017-04-11 | Walleye Software, LLC | Parsing and compiling data system queries |
US10078562B2 (en) | 2015-08-18 | 2018-09-18 | Microsoft Technology Licensing, Llc | Transactional distributed lifecycle management of diverse application data structures |
US10838827B2 (en) | 2015-09-16 | 2020-11-17 | Richard Banister | System and method for time parameter based database restoration |
US10990586B2 (en) | 2015-09-16 | 2021-04-27 | Richard Banister | System and method for revising record keys to coordinate record key changes within at least two databases |
US20170270149A1 (en) * | 2016-03-15 | 2017-09-21 | Huawei Technologies Co., Ltd. | Database systems with re-ordered replicas and methods of accessing and backing up databases |
US10691723B2 (en) * | 2016-05-04 | 2020-06-23 | Huawei Technologies Co., Ltd. | Distributed database systems and methods of distributing and accessing data |
US10453076B2 (en) * | 2016-06-02 | 2019-10-22 | Facebook, Inc. | Cold storage for legal hold data |
JP6235082B1 (ja) * | 2016-07-13 | 2017-11-22 | ヤフー株式会社 | データ分類装置、データ分類方法、およびプログラム |
US10209901B2 (en) * | 2017-01-04 | 2019-02-19 | Walmart Apollo, Llc | Systems and methods for distributive data storage |
CN106855892A (zh) * | 2017-01-13 | 2017-06-16 | 贵州白山云科技有限公司 | 一种数据处理方法以及装置 |
CN107169075A (zh) * | 2017-05-10 | 2017-09-15 | 深圳大普微电子科技有限公司 | 基于特征分析的数据存取方法、存储设备及存储系统 |
US10198469B1 (en) | 2017-08-24 | 2019-02-05 | Deephaven Data Labs Llc | Computer data system data source refreshing using an update propagation graph having a merged join listener |
US10671482B2 (en) * | 2017-09-12 | 2020-06-02 | Cohesity, Inc. | Providing consistency in a distributed data store |
WO2020091181A1 (ko) * | 2018-11-01 | 2020-05-07 | 엘지전자 주식회사 | 무선 통신 시스템에서 신호를 송수신하는 방법 및 장치 |
US10796276B1 (en) * | 2019-04-11 | 2020-10-06 | Caastle, Inc. | Systems and methods for electronic platform for transactions of wearable items |
US11194769B2 (en) | 2020-04-27 | 2021-12-07 | Richard Banister | System and method for re-synchronizing a portion of or an entire source database and a target database |
US11847100B2 (en) | 2020-11-19 | 2023-12-19 | Alibaba Group Holding Limited | Distributed file system servicing random-access operations |
US11681664B2 (en) * | 2021-07-16 | 2023-06-20 | EMC IP Holding Company LLC | Journal parsing for object event generation |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010146067A (ja) * | 2008-12-16 | 2010-07-01 | Fujitsu Ltd | データ処理プログラム、サーバ装置およびデータ処理方法 |
JP2011008711A (ja) * | 2009-06-29 | 2011-01-13 | Brother Industries Ltd | ノード装置、処理プログラム及び分散保存方法 |
Family Cites Families (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3911810B2 (ja) | 1998-01-07 | 2007-05-09 | 富士ゼロックス株式会社 | 情報流通システム及び可搬型記憶媒体 |
US20010044879A1 (en) * | 2000-02-18 | 2001-11-22 | Moulton Gregory Hagan | System and method for distributed management of data storage |
US20020138559A1 (en) * | 2001-01-29 | 2002-09-26 | Ulrich Thomas R. | Dynamically distributed file system |
JP4704659B2 (ja) * | 2002-04-26 | 2011-06-15 | 株式会社日立製作所 | 記憶装置システムの制御方法および記憶制御装置 |
US7284010B2 (en) * | 2003-10-23 | 2007-10-16 | Microsoft Corporation | System and method for storing and retrieving a field of a user defined type outside of a database store in which the type is defined |
US7500020B1 (en) * | 2003-12-31 | 2009-03-03 | Symantec Operating Corporation | Coherency of replicas for a distributed file sharing system |
US20050278385A1 (en) * | 2004-06-10 | 2005-12-15 | Hewlett-Packard Development Company, L.P. | Systems and methods for staggered data replication and recovery |
JP4528039B2 (ja) | 2004-06-29 | 2010-08-18 | 国立大学法人東京工業大学 | 自律ストレージ装置、自律ストレージシステム、ネットワーク負荷分散プログラム及びネットワーク負荷分散方法 |
US7457835B2 (en) * | 2005-03-08 | 2008-11-25 | Cisco Technology, Inc. | Movement of data in a distributed database system to a storage location closest to a center of activity for the data |
JP4668763B2 (ja) * | 2005-10-20 | 2011-04-13 | 株式会社日立製作所 | ストレージ装置のリストア方法及びストレージ装置 |
US7325111B1 (en) * | 2005-11-01 | 2008-01-29 | Network Appliance, Inc. | Method and system for single pass volume scanning for multiple destination mirroring |
US7653668B1 (en) * | 2005-11-23 | 2010-01-26 | Symantec Operating Corporation | Fault tolerant multi-stage data replication with relaxed coherency guarantees |
US7716180B2 (en) * | 2005-12-29 | 2010-05-11 | Amazon Technologies, Inc. | Distributed storage system with web services client interface |
US7885923B1 (en) * | 2006-06-30 | 2011-02-08 | Symantec Operating Corporation | On demand consistency checkpoints for temporal volumes within consistency interval marker based replication |
US8019727B2 (en) * | 2007-09-26 | 2011-09-13 | Symantec Corporation | Pull model for file replication at multiple data centers |
US8620861B1 (en) * | 2008-09-30 | 2013-12-31 | Google Inc. | Preserving file metadata during atomic save operations |
US8055615B2 (en) * | 2009-08-25 | 2011-11-08 | Yahoo! Inc. | Method for efficient storage node replacement |
US8886602B2 (en) * | 2010-02-09 | 2014-11-11 | Google Inc. | Location assignment daemon (LAD) for a distributed storage system |
US8805783B2 (en) * | 2010-05-27 | 2014-08-12 | Microsoft Corporation | Synchronization of subsets of data including support for varying set membership |
-
2012
- 2012-03-08 US US14/003,658 patent/US9342574B2/en active Active
- 2012-03-08 WO PCT/JP2012/055917 patent/WO2012121316A1/ja active Application Filing
- 2012-03-08 JP JP2013503590A patent/JP5765416B2/ja active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010146067A (ja) * | 2008-12-16 | 2010-07-01 | Fujitsu Ltd | データ処理プログラム、サーバ装置およびデータ処理方法 |
JP2011008711A (ja) * | 2009-06-29 | 2011-01-13 | Brother Industries Ltd | ノード装置、処理プログラム及び分散保存方法 |
Non-Patent Citations (2)
Title |
---|
AVINASH LAKSHMAN ET AL.: "Cassandra - A Decentralized Structured Storage System", ACM SIGOPS OPERATING SYSTEMS REVIEW, vol. 44, no. 2, April 2010 (2010-04-01), pages 35 - 40 * |
SHUNSUKE NAKAMURA ET AL.: "A Cloud Storage Adaptable to Read-Intensive and Write-Intensive Workload", IPSJ SIG NOTES, vol. 2011-116, no. 9, 15 February 2011 (2011-02-15), pages 1 - 7 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5342055B1 (ja) * | 2012-10-30 | 2013-11-13 | 株式会社東芝 | 記憶装置およびデータバックアップ方法 |
WO2014069007A1 (ja) * | 2012-10-30 | 2014-05-08 | 株式会社 東芝 | 記憶装置およびデータバックアップ方法 |
JP2015095246A (ja) * | 2013-11-14 | 2015-05-18 | 日本電信電話株式会社 | 情報処理システム、管理装置、サーバ装置及びキー割当プログラム |
WO2015118865A1 (ja) * | 2014-02-05 | 2015-08-13 | 日本電気株式会社 | 情報処理装置、情報処理システム及びデータアクセス方法 |
JP2015184685A (ja) * | 2014-03-20 | 2015-10-22 | 日本電気株式会社 | 情報記憶システム |
US10095737B2 (en) | 2014-03-20 | 2018-10-09 | Nec Corporation | Information storage system |
US10318506B2 (en) | 2015-03-18 | 2019-06-11 | Nec Corporation | Database system |
CN111988359A (zh) * | 2020-07-15 | 2020-11-24 | 中科物缘科技(杭州)有限公司 | 基于消息队列的数据分片同步方法及系统 |
CN111988359B (zh) * | 2020-07-15 | 2023-08-15 | 中国科学院计算技术研究所数字经济产业研究院 | 基于消息队列的数据分片同步方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2012121316A1 (ja) | 2014-07-17 |
US20130346365A1 (en) | 2013-12-26 |
JP5765416B2 (ja) | 2015-08-19 |
US9342574B2 (en) | 2016-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5765416B2 (ja) | 分散ストレージシステムおよび方法 | |
JP6044539B2 (ja) | 分散ストレージシステムおよび方法 | |
US10997163B2 (en) | Data ingestion using file queues | |
US8335769B2 (en) | Executing replication requests for objects in a distributed storage system | |
CN106233275B (zh) | 数据管理系统及方法 | |
US10853242B2 (en) | Deduplication and garbage collection across logical databases | |
CN102831120B (zh) | 一种数据处理方法及系统 | |
US10866970B1 (en) | Range query capacity allocation | |
US20060271653A1 (en) | Computer system | |
US10394782B2 (en) | Chord distributed hash table-based map-reduce system and method | |
US9330158B1 (en) | Range query capacity allocation | |
Davoudian et al. | A workload-adaptive streaming partitioner for distributed graph stores | |
Sundarakumar et al. | A comprehensive study and review of tuning the performance on database scalability in big data analytics | |
Cao et al. | Polardb-x: An elastic distributed relational database for cloud-native applications | |
US11449521B2 (en) | Database management system | |
CN109716280A (zh) | 灵活的内存列存储布置 | |
JP2013088920A (ja) | 計算機システム及びデータ管理方法 | |
JP2008186141A (ja) | データ管理方法、データ管理プログラム、データ管理システム、および、構成管理装置 | |
KR101792189B1 (ko) | 빅 데이터 처리 장치 및 방법 | |
US11550793B1 (en) | Systems and methods for spilling data for hash joins | |
Nurul | IMPROVED TIME COMPLEXITY AND LOAD BALANCE FOR HDFS IN MULTIPLE NAMENODE | |
CN116975053A (zh) | 一种数据处理方法、装置、设备、介质及程序产品 | |
JP2022070669A (ja) | データベースシステム、及びクエリ実行方法 | |
Dash et al. | Review on fragment allocation by using clustering technique in distributed database system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12755333 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2013503590 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14003658 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12755333 Country of ref document: EP Kind code of ref document: A1 |