WO2017104012A1

WO2017104012A1 - Data management system and method

Info

Publication number: WO2017104012A1
Application number: PCT/JP2015/085149
Authority: WO
Inventors: 友隆塩野谷; 高橋　正和
Original assignee: 株式会社日立製作所
Priority date: 2015-12-16
Filing date: 2015-12-16
Publication date: 2017-06-22

Abstract

A data management system stores key transformation method information which is information which represents a key transformation method which has been applied from among a plurality of key transformation methods to each of a plurality of access ranges in a database unit. The database unit stores one or more key data pair units. The data management system carries out for each of the one or more key data pair units: a selection of the key transformation method from the plurality of the key transformation methods which is suited to an access pattern to which an access situation belongs for the key data pair unit; and recording in key transformation method information an association between the selected key transformation method, and the access range which accords with the key data pair unit and a key range for the key data pair unit.

Description

Data management system and method

The present invention generally relates to data management for managing data in association with a key.

In recent years, explosive data growth has occurred due to technologies that store data output from devices such as sensors through a network without human intervention (for example, Machine to Machine (M2M) or Internet of Things (IoT)). is happening. And the big data analysis which is utilized for the prediction of the event which could not be predicted until now by combining various data obtained in this way from various angles is performed.

∙ Data generated by technologies such as M2M or IoT is generally not standardized (that is, unstructured data is generally generated), and the data required for big data analysis is large. For this reason, data management is becoming difficult in a relational database in which data is stored according to a rule (schema) fixed to a recording device having a predetermined capacity.

Therefore, it is compatible with both the scale-out property in which the performance (for example, the bandwidth) and the capacity (storage capacity) can be easily changed by increasing / decreasing the number of servers storing the data and the schemaless property in which the data can be stored without setting the schema. NoSQL is attracting attention.

However, NoSQL has the basic functions of relational databases, typically high-speed data reference function using indexes, and cross-search function by table join (search function across multiple tables). There is no. Therefore, in order to improve performance related to data management (for example, to improve search speed), data pre-processing that predicts actual access patterns is required.

Regarding the data preprocessing in NoSQL, the technology disclosed in Patent Document 1 is known. The technique of Patent Document 1 receives array representation data and changes the data into parallel individual elements.

Regarding the data management, the techniques of Patent Documents 2 and 3 are known. The techniques of Patent Documents 2 and 3 perform a data search with reference to a data usage history, and perform a normal data search when no search target data is found in the data search.

JP 2013-196205 A JP 2001-1555028 A JP 2010-267080 A

A data infrastructure is designed according to the access pattern (use case). The “data infrastructure” referred to here is a computer system composed of one or more data management devices, for example, a computer system composed of one or more data management devices (nodes) that manage distributed key-value pairs. .

However, in reality, a data infrastructure corresponding to the imagined access pattern is designed. This is because the actual access pattern is not known unless the data infrastructure is operating.

Therefore, it is not always possible to design a data infrastructure (particularly a key conversion method for distributing key-value pairs) that matches the use case. In particular, in big data analysis that analyzes a large amount of data generated by technologies such as M2M or IoT, access with different access patterns may be added to the constructed data infrastructure. That is, the access pattern may not be fixed.

This problem cannot be solved by any of Patent Documents 1 to 3. Specifically, the technique of Patent Document 1 relies on data design described in a document-oriented manner and does not consider access patterns. The techniques of Patent Documents 2 and 3 are techniques for structured data, and cannot be applied to an environment where unstructured data is stored.

The data management system stores key conversion method information that is information indicating a key conversion method applied among a plurality of key conversion methods for each of a plurality of access ranges in the database unit. The database unit is one or more databases having one or more data areas in which one or more key data pair units are stored. Each of the one or more key data pair units is one or more key data pairs. Each of the one or more key data pairs is a key and data pair. Each of the plurality of access ranges is a range according to a set of a key data pair unit and a key range of the key data pair unit. The data management system receives the query, performs data reference processing or data storage processing on the database unit in response to the query, and returns the query result. In each of the data reference process and the data storage process, the data management system specifies a key conversion method corresponding to the access range in the process from the key conversion method information, and uses the specified key conversion method. The data management system selects, for each of one or more key data pair units, a key conversion method suitable for an access pattern to which the access status of the key data pair unit belongs, from a plurality of key conversion methods, and selects the selected key Correspondence between the conversion method and the access range according to the key data pair unit and the key range for the key data pair unit is recorded in the key conversion method information.

Even if the access pattern to which the access status of the key data pair unit in the database unit belongs, the key conversion method that is most suitable for the changed access pattern can be applied to the key data pair unit.

1 shows a configuration of a data management apparatus according to an embodiment. An example of data space is shown. The structure of an access source table is shown. The structure of a query table is shown. The structure of a key conversion function table is shown. The structure of a key conversion function determination table is shown. An outline of the processing of the data management device is shown. The flow of data reference processing is shown. The flow of data storage processing is shown. The flow of a key conversion function determination process is shown. The flow of a key conversion function narrowing process is shown. An example of a setting screen is shown. An example of unifying key conversion functions for all keys in the namespace is shown below.

Hereinafter, an embodiment will be described.

In the following description, information may be described using an expression such as “xxx table”, but the information may be expressed in any data structure. That is, in order to show that the information does not depend on the data structure, the “xxx table” can be referred to as “xxx information”. In the following description, the configuration of each table is an example, and one table may be divided into two or more tables, or all or part of the two or more tables may be a single table. Good.

In the following description, the process may be described using “program” as a subject, but the program is executed by a processor (for example, a CPU (Central Processing Unit)) to appropriately store a predetermined process. The subject of processing may be a processor in order to use resources (for example, memory) and / or communication interface devices (for example, communication ports), etc. The processing described with a program as the subject includes the processor or the processor. The processor may include a hardware circuit that performs a part or all of the processing, and the program may be installed in each controller from the program source. Program distribution computer or computer readable In the following description, two or more programs may be realized as one program, or one program may be realized as two or more programs.

Moreover, in the following description, when explaining without distinguishing the same kind of element, a reference code (or a common part in the reference sign) is used, and when explaining the same kind of element separately, the element ID (or Element reference signs) may be used. For example, “data area 600” is described when the data area is not particularly distinguished, and “data area 600A” and “data area 600B” are described when the individual data areas are distinguished. May be described.

FIG. 1 shows a configuration of a data management apparatus according to the embodiment.

There is a data infrastructure (data management system) composed of one or more data management devices 1. In the present embodiment, there is one data management device 1 configuring the data infrastructure, but the data infrastructure may be configured by a plurality of data management devices 1.

The data management apparatus 1 is typically a computer, and includes a processor 10, a memory 12, a communication interface 14, a database 16, and a communication bus 18 that connects these components.

The processor 10 is an example of a processor unit, and is a device that executes a program stored in the memory 12 and reads / writes data from / to the memory 12 and the database 16. A plurality of processors 10 may be mounted. It is assumed that at least one processor 10 includes a timer 20 for measuring time. However, this configuration is not essential, and an external timer may be connected via the communication interface 14, or the internal processing of the program executed by the processor 10 is treated as a virtual clock as one clock. Also good.

The memory 12 is an example of a memory unit, and is a device that stores a program executed by the processor 10 and information referred to or updated by the program. The memory 12 may be a volatile memory such as a DRAM (Dynamic Random Access Memory), a non-volatile storage device such as an SSD (Solid State Disk), or a write-once medium such as a CD-Recordable. Further, the memory 12 is not necessarily composed of a single device, but a plurality of or a plurality of types of memories are arranged in a parallel configuration such as RAID (Redundant Array of Independent (or Inexpensive) Disks) or JBOD (Just a (Bunch Of Disks) may be connected in series.

The communication interface 14 is an example of an interface unit, and is a device that transmits commands and data to other devices connected via the network 4 or receives commands and data from other devices. The network 4 may be realized using a physical cable or may be realized using a wireless technology. The network 4 may be a local area network (LAN) or a wide area network (WAN). The communication interface 14 is not necessarily one. For example, when the data management apparatus 1 is connected to a plurality of types of networks, a plurality of types of communication interfaces corresponding to the networks may be mounted. A plurality of communication interfaces corresponding to the same network may be mounted for the purpose of securing a dedicated network. Thereafter, unless otherwise specified in data transmission / reception in other apparatuses, transmission / reception is implicitly realized via the communication interface 14 and the network 4.

The memory 12 stores a function determination program 100, a query determination program 102, a query division program 104, a key conversion program 106, a key reverse conversion program 108, a sort program 110, and a key conversion function group 112. ing. These programs need not always be stored in the memory 12, and may be loaded from another device connected via the network 4 when executed. If the program has a compile module, When executing, the source file may be converted into native code that can be interpreted by the processor 10 and placed in the memory 12.

The function determination program 100 refers to a table stored in the database 16, selects a key conversion function used for conversion of a key associated with data, and records the key and the key conversion function in the key conversion function table 304. The query determination program 102 determines whether a query (query for the data area) received from the outside (for example, a client device of the data management apparatus 1) is a query for Get, Scan, or storage (details of the processes will be described later). to decide. The query dividing program 104 converts (divides as necessary) the query so that the key described in the query for Scan or Get matches the converted key. The function determination program 100 converts a key using a key conversion function. The key reverse conversion program 108 can also reverse-convert (return to the original key) the key converted by the key conversion function. In this embodiment, there is a key conversion function narrowing-down process (a process including key recovery) using the key reverse conversion program 108, but key recovery is not indispensable in an aspect only for data reference. That is, each key conversion function in the key conversion function group 112 does not necessarily have reversibility. The sort program 110 sorts the data in ascending order or descending order with keys associated with the data referred to by the query for scanning. The key conversion function group 112 is a set of a plurality of key conversion functions. The key conversion function is an example of a key conversion method, and is a function (for example, a program) that converts a key into a different value. Each key conversion function may be an arbitrary function. In the present embodiment, there are the following three key conversion functions as the plurality of key conversion functions.
FP (Field Promotion): The first value (hexadecimal) of data is given to the head of the key.
Salt (n): A number is added to the head of the key. The range of numbers is an integer from 0 to n.
Hash: A hash value of data is added to the head of the key. It is assumed that the hash value is given in 2 bytes (256 ways).

The communication bus 18 is a device for realizing transmission / reception of data between the components in the data management apparatus 1. For example, an internal bus (for example, CPU bus or address bus) and an expansion bus (for example, Serial ATA PCI Express) is a device that is configured. The communication bus 18 is generally a wired bus, but may be a wireless bus for the purpose of simplifying wiring. Thereafter, when there is no particular notice in the communication among the processor 10, the memory 12, the communication interface 14, and the database 16, transmission / reception is implicitly performed via the communication bus 18.

The database 16 is a general term for physical or logical devices that record each of one or more data as a pair and output data paired with an input key (key range). In the present embodiment, a process for referring to data using each of one or more keys is described as “Get”, and a process for referring to data by specifying a key range is described as “Scan”. In response to the query, a data reference process (Get or Scan) or a data storage process (for example, an Insert process) is performed. The query for Get is described as “Get query”, the query for Scan is described as “Scan query”, and “reference query” is described as a generic term for Get queries and Scan queries. On the other hand, a query for data storage processing is described as “stored query”.

In general, the database 16 is realized based on a non-volatile storage device such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive). However, if semi-permanent storage allowed by the application is possible, a non-volatile memory is used. Alternatively, an updatable optical recording medium such as Brue-lay Disc or a write-once medium such as CD-R may be used. At least a part of the database 16 may be realized based on a partial area of the memory 12 (at least a part of information stored in the database 16 may be stored in a partial area of the memory 12). And may be realized based on an external storage device (not shown) connected via the communication interface 14.

The database 16 has a data space 159. The database 16 stores an access source table 300, a query table 302, a key conversion function table 304, and a key conversion function determination table 306 (details of each table will be described later). At least one of these tables 300, 302, 304, and 306 may be stored in the memory 12.

In the data space 159, a plurality of key data pairs are stored. A typical example of a key data pair is a key-value pair. In the data space 159, a key-value pair may be stored according to a distributed key-value store (KVS), or a key-value pair may be stored according to a column store (for example, for each column, all or part of the column) May be stored compressed).

In FIG. 1, there is one database 16, but the database 16 may be a divided database (a set of database parts), and therefore the data space 159 is also a divided data space (a set of data space parts). Good. Each data space portion is described as a “data area” in the present embodiment. The data space 159 is one or more data areas. There is one or more name spaces in the data area. In the present embodiment, the “name space” is a storage space in which key data pairs are stored.

Generally, multiple key conversion functions cannot be applied to a single name space, and once applied key conversion functions cannot be switched in the middle.

The data management device 1 manages a key conversion function for each key range, and can apply different key conversion functions in the same data space (the embodiment may be applied to a data management device having one database). The data management device 1 can automatically optimize a key conversion function (data preprocessing method) associated with a key range.

For this realization, for example, the following two elements are important.
(Element 1) How to determine the key conversion function.
(Element 2) How to record the key conversion function after determination.

For (Element 1), the key conversion function is determined based on the access pattern. Specifically, an optimal key conversion function is determined by a key conversion function determination process (FIG. 10) performed by the function determination program 100 using the key conversion function determination table 306 (FIG. 6). In addition to the key conversion function determination process, as an option, a key conversion function narrowing process (FIG. 11) may be performed.

Regarding (element 2), the determined key conversion function is recorded in the key conversion function table 304 (FIG. 5) in association with the key range.

Hereinafter, this embodiment will be described in detail.

FIG. 2 shows an example of the data space 159.

The data space 159 illustrated in FIG. 2 includes

data areas

600A, 600B, and 600C.

Data for the data space 159 is stored in a name space in any of the

data areas

600A, 600B, and 600C in association with a key. The name space is an example of a key data pair unit (one or more key data pairs). If the name space size (number of key data pairs accumulated in the name space) exceeds a certain value, the new data storage destination is a different name space (name space in a different data area or in the same data area Or the data storage destination name space is divided into a plurality of blocks 408. The block 408 is a set of one or more (for example, a certain number) key data pairs continuously arranged in an area suitable for access such as a logical block address space. As a result, the key data pairs are distributed to a plurality of different name spaces in a plurality of

data areas

600A, 600B, and 600C. That is, the data management apparatus 1 (database 16) has a sharding function for distributing data to a plurality of

data areas

600A, 600B, and 600C. Various key data pairs can be distributed according to the sharding policy. For example, in consideration of capacity scaling, key data pair distribution may be performed so that the number of name spaces and the size of the name spaces are equal in the

data areas

600A, 600B, and 600C.

It should be noted that the set of keys and namespaces has unique values throughout the data space 159. Therefore, if the same key as the key linked to the new data already exists in the name space where the data (new data) is stored, the data linked to the existing key (old data) ) Is updated with new data. In addition, the key data pairs are stored in an ascending order sorted by key in each name space.

FIG. 3 shows the configuration of the access source table 300.

The access source table 300 is a table that stores information related to the access source. Specifically, the access source table 300 has a record for each access source. Each record stores a name space 31, an address 32, an issue count 33 and an access time 34.

Name space 31 represents the name of the name space accessed from the access source.

Address 32 represents an address on the access source network 4. The address may be an IP address, a MAC address, or a URI (such as server1.abc.company.co.jp) in an environment where Domain Name Server can convert an IP address. Uniform Resource Identifier).

The issuance count 33 represents the number of times that the reference query is issued from the access source (in other words, the number of reference queries received from the access source). The number of times of issuance 33 may be reset to an initial value every fixed period, or may be counted in total until the name space is generated and deleted.

The access time 34 represents the relationship between the time zone (for example, AM (AM) and PM (PM)) and the number of received reference queries. The time zone may not be a unit of 12 hours such as morning and afternoon. For example, the unit may be one hour or one minute. Further, the time zone is a time zone in the time zone, and the time zone is one day according to the example of FIG. 3, but is not limited thereto. For example, the time zone is one week, and the time zone may be Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday.

According to the example of FIG. 3, the reference query for the namespace “ns1: customer” is issued from the access source of the address “192.168.0.10”, and the number of issued reference queries is 4010 (AM : 4010 reference queries, afternoon: 0 reference queries).

FIG. 4 shows the configuration of the query table 302.

The query table 302 is a table that stores information related to the breakdown of reference queries issued to the name space. The query table 302 has a record for each name space and reference query type. Each record stores a name space 41, a query 42, an issue count 43, and an average scan count 44.

Name space 41 represents the name of the name space. The query 42 represents a reference query type (“Get” or “Scan”). The issue count 43 represents the number of times that the reference query of the corresponding reference query type has been issued. The average number of scans 44 represents the average number of scans (the number of data referenced by one Scan query). If the value represented by the average number of scans 44 exceeds the sharding size (the maximum value of the number of key data pairs per name space), the data of other name spaces corresponding to the average number of scans 44 Is also referred to.

The number 43 of Get queries issued with X keys (X is an integer equal to or greater than 2) may be counted once, or may be counted X times or Y times. X is the number of specified keys. Y is the number of keys that match the data among the specified keys (Y is an integer not less than 0 and not more than X).

From each of the access source table 300 in FIG. 3 and the query table 302 in FIG. Know the number of data). Based on these three elements, the key conversion function table 304 is created by the function determination program 100. The element representing the access status may be one or two of these three elements, or another element may be used instead of or in addition to at least one of these three elements. .

FIG. 5 shows the configuration of the key conversion function table 304.

The key conversion function table 304 is a table representing the relationship between the key range in the name space and the applied key conversion function. The key conversion function table 304 has a plurality of records, and each record stores a name space 51, a key range 52, and a key conversion function 53. A range specified from the set of the name space 51 and the key range 52 is an example of an access range.

Name space 51 represents the name of the name space.

The key range 52 represents the first key of the key range and the last key of the key range. When the first key is “null”, the key range includes all keys smaller than the last key. When the last key is “null”, the key range includes all keys larger than the first key.

The key conversion function 53 represents the name (type) of the applied key conversion function.

This embodiment is an embodiment in which the key is assumed to monotonously increase (or monotonously decrease) (specifically, implementation that is assumed to receive time-series data generated by a technology such as M2M or IoT) However, this assumption does not narrow the scope of the present invention. The present invention is also applicable to an environment in which keys change randomly or keys increase and decrease.

FIG. 6 shows the configuration of the key conversion function determination table 306.

The key conversion function determination table 306 is a table representing the relationship between access patterns and key conversion functions. Specifically, the key conversion function determination table 306 has a plurality of records, and each record stores the data distribution 61, the access source information 62, and the key conversion function 63.

The data distribution 61 represents the data distribution per reference query. The data distribution is the number of reference namespaces that exist per data area. The “reference name space” is a name space (for example, a table) referred to in the processing of the reference query. Therefore, for example, if the number of data areas = 4 and the number of reference namespaces = 2, the data distribution degree = 0.5 (= 2/4), and if the number of reference namespaces = 6, the data distribution degree = 1.5 (= 6/4). A more specific example of the data distribution is as follows.
(Assumption (assumed))
・ Average number of scans = 450
・ Sharding size = 100
・ Number of namespaces = 10
-Number of data areas = 10
-One name space exists for each of the 10 data areas.
When there are a plurality of data management apparatuses 1, the “number of data areas” referred to here is the number of data areas existing in the plurality of data management apparatuses 1 as a whole.
When the reference query is a Scan query, the average number of scans is a value represented by the average number of scans 44 corresponding to the referenced namespace and the type of reference query “Scan”.
(Data distribution)
(Case 1) Case of a key conversion function that continuously arranges data in the same name space of the same data area as in FP:
Data degree of distribution = number of reference namespaces / number of data areas = {value obtained by rounding up the first decimal place of (average scan count / sharding size)} / number of data areas = {first decimal place of (450/100) Rounded up value} / 10
= 5/10
= 0.5
(Case 2) Key conversion function that arranges data while switching storage destination data areas one by one like Salt or Hash (places data across multiple data areas):
Data dispersion = number of reference namespaces / number of data areas = (average number of scans (when the average number of scans is less than the number of namespaces) or number of namespaces (when the average number of scans is greater than or equal to the number of namespaces) / data area Number = 10/10
= 1.0

The access source information 62 includes an address number 621 that represents the number of access sources and an access frequency 622 that represents an access frequency (reference frequency). The access frequency referred to here is an access frequency per time section (per day in FIG. 6).

The key conversion function 63 represents the name (type) of the key conversion function.

Referring to the key conversion function determination table 306, the key conversion function associated with the data distribution degree, the access source number, and the access frequency can be specified.

In the example of FIG. 6, each of the data distribution degree, the access source number, and the access frequency is two types (that is, the boundary value is one), but the data distribution degree, the access source number, and the access frequency are At least one may be three or more (the boundary value may be two or more).

Further, according to the example of FIG. 6, in the present embodiment, the “access pattern” is a set of data distribution degree, access source number, and access frequency. A key conversion function is associated with each access pattern. The access pattern may be at least one of data distribution, access source number, and access frequency.

Hereinafter, processing performed in this embodiment will be described.

FIG. 7 shows a processing outline of the data management device 1.

The query determination program 102 receives a query from the client (user) 6 via the network 4. The query determination program 102 determines the type of the received query (whether it is a reference query or a storage query. Depending on the determination result, a data reference process or a data storage process is performed, and the query determination program 102 sends the result to the client device. Return to 6.

The data reference process will be described with reference to FIGS.

S102: The query determination program 102 updates the access source table 300. Specifically, the query determination program 102 adds 1 to the issuance count 33 corresponding to the address 32 indicating the access source address (address of the client device 6), and the access time (timer) at the access time 34. 1 is added to the value corresponding to the time zone belonging to (the time specified from 20). When the address 32 representing the access source address does not exist in the access source table 300, the query determination program 102 adds a record including the address 32 representing the access source address. The namespace 31 in the added record represents a namespace that is referred to according to the reference query. When a plurality of name spaces are referred to by the reference query, a plurality of records respectively corresponding to the plurality of name spaces are added to the access source table 300.

S104: The query determination program 102 calls the query division program 104. The query determination program 102 transmits the received reference query to the query division program 104. The query division program 104 refers to the key conversion function table 304, and corresponds to the target namespace (namespace referenced according to the reference query) and the key range to which the target key (key specified by the reference query) belongs. Identify the conversion function. The query division program 104 converts the target key using the specified key conversion function in the key conversion function group 112. The query division program 104 generates a query specifying the converted key. In S104, the query division program 104 may generate a plurality of queries for one reference query. For example, if the name space referred to is the namespace “ns1: customer” and the keys specified in the reference query are 090000 to 110000, the key conversion function “Salt (5)” is specified for 090000 to 100000, A key conversion function “FP” is specified for 100001 to 110000. In this case, for each reference query, the query partitioning program 104 specifies a query in which the target keys 100001 to 110000 have been converted by Salt (5) and a key in which the target keys 100001 to 110000 have been converted by FP. Generated queries.

S106: The query division program 104 transmits the generated query to the database 16. The database 16 extracts data corresponding to the key specified in the query.

S108: The database 16 calls the key reverse conversion program 108. The key reverse conversion program 108 reverse-converts the converted key specified in the query received by the database 16 into the original key (returns to the original key). For example, in this embodiment, since FP, Salt, and Hash are all processing that adds an additional character string to the head, the key reverse conversion program 108 performs reverse conversion on the processing that excludes the head character string. It may be implemented as.

S110: The database 16 uses the sort program 110 to perform ascending sort (or descending sort) with the original key. S108 and S110 are processes in which the key converted by the key conversion function specified in S104 is inversely converted and the search result is re-sorted based on the original key. If data continuity recovery is not required, S108 and S110 may be skipped.

S112: The database 16 transmits the search result (the extraction result of S106 (or the sort result of S110)) to the query determination program 102. The query determination program 102 updates the query table 302. Specifically, the query determination program 102 adds 1 to the issue count 43 corresponding to the referenced name space and the type of the reference query. When the reference query is a Scan query, the query determination program 102 updates the average number of scans 44 corresponding to the referenced namespace and the type of the reference query.

The process including S102 to S112 is the data reference process. The query determination program 102 transmits the search result to the client device 6.

Data storage processing will be described with reference to FIGS.

S122: The query determination program 102 calls the key conversion program 106. The key conversion program 106 refers to the key conversion function table 304 and specifies a key conversion function corresponding to the name space specified by the storage query and the key range to which the key specified by the storage query belongs. The key conversion program 106 converts the specified key using the specified key conversion function. The key conversion program 106 generates a query that stores data to be stored and a post-conversion key corresponding to the data. A plurality of queries may be generated for one stored query for the same reason as described for S104.

S124: The key conversion program 106 transmits the generated query to the database 16. In response to the query, the database 16 stores the set of data and converted key in the namespace specified by the query. The database 16 notifies the query determination program 102 of the completion of storage.

The processing including the above S122 to S124 is data storage processing. The query determination program 102 transmits the result to the client device 6.

As shown in FIG. 7, the data management device 1 executes a key conversion function determination process in addition to the data reference process and the data storage process. The key conversion function determination process is started when a predetermined event occurs. The predetermined event is, for example, that a certain time has passed (the time specified by the timer 20 has become a predetermined time), the number of received queries exceeds the predetermined number, or the amount of stored data is Any of exceeding the fixed quantity may be sufficient. The function determination program 100 calculates the data distribution, the number of access sources, and the access frequency from the operation information (access source table 300 and query table 302). The function determination program 100 determines a key conversion function corresponding to the calculated data distribution, access source number, and access frequency by referring to the key conversion function determination table 306. The data management device 1 records the determined name (type) of the key conversion function in the key conversion function table 304 in association with the name of the name space and the key range.

Hereinafter, the key conversion function determination process will be described in detail.

FIG. 10 shows a flow of key conversion function determination processing. The key conversion function determination process is performed for each name space. Hereinafter, a key conversion function determination process for one name space is taken as an example. In the description with reference to FIG. 10 (and FIG. 11), the one namespace is referred to as “target namespace”.

S142: The function determination program 100 refers to the query table 302 and identifies the average number of scans corresponding to the target namespace.

S144: The function determination program 100 specifies the number of data areas and the sharding size by issuing a query to the database 16.

S146: The function determination program 100 refers to the access source table 300, accesses the number of access sources corresponding to the target namespace (the number of addresses 32 corresponding to the target namespace), and the access frequency corresponding to the target namespace (for example, 1 Number of reference queries per day).

L140: Loop start. The function determination program 100 repeats the process up to the loop end L142 for each key conversion function included in the key conversion function group 112. In each loop, the key conversion function referred to is “function F”.

S150: The function determination program 100 calculates the data distribution for F using the average number of scans specified in S142, the number of data areas specified in S144, and the sharding size. The details of the data distribution calculation method are as described above.
S152: The function determination program 100 refers to the key conversion function determination table 306, and specifies a key conversion function corresponding to the data distribution calculated in S150, the number of access sources calculated in S146, and the access frequency. The key conversion function specified here is expressed as “function f”.

C140: The function determination program 100 determines whether function F = function f.
S154: When the determination result of C140 is true, the function determination program 100 registers F (the name of F) in the candidate list. The candidate list is a list of names of key conversion function candidates (temporarily determined key conversion functions). A key conversion function is determined for the target namespace from the candidate list.

L142: Loop end.

C142: The function determination program 100 refers to the candidate list and determines K (number of key conversion function candidates).

When K = 0, the function determination program 100 ends the key conversion function determination process for the target namespace.

When K = 1, the function determination program 100 updates the key conversion function table 304 (S156). Specifically, in the function determination program 100, the key conversion function represented by the candidate list is different from the latest key conversion function for the target namespace (the key conversion function corresponding to the end key of the key range “null”). Then, the association between the key conversion function represented by the candidate list, the target namespace, and the key range is recorded in the key conversion function table 304. Specifically, the function determination program 100 determines a key range corresponding to the latest key conversion function for the target namespace (changes the last key from “null” to the largest key), and a new key range. (The first key and the last key are “null”, respectively) and the key conversion function indicated by the candidate list are recorded in the key conversion function table 304. When new data and a key associated with the data are stored in the target namespace, the first key in the new key range is used as the associated key.

When K = 2 or more, the function determination program 100 performs key conversion function narrowing processing (P2). Thereby, two or more key conversion function candidates are narrowed down to one. The function determination program 100 performs the above-described S156 for the key conversion functions narrowed down to one of them.

As described above, the number of key conversion functions (names) recorded in the candidate list may be two or more. Specifically, as described above, the data distribution calculation method employed for FP is different from the data distribution calculation method employed for Salt (and Hash). Therefore, different data distribution degrees are calculated for the target namespace. As a result, a plurality of different key conversion functions (function f) may be identified from the key conversion function determination table 306 even if the access source number and the access frequency set are the same. In this case, two or more key conversion function candidates (names) can be registered in the candidate list.

The function determination program 100 may select any one of the two or more key conversion function candidates, but in the present embodiment, the key conversion function narrowing process is performed as described above. Thus, two or more key conversion function candidates are narrowed down to one.

FIG. 11 shows a flow of the key conversion function narrowing process.

The key conversion function narrowing-down process is performed in a range (data distribution 61, address 621) corresponding to at least one of the data distribution (calculated in S150), the number of access sources (calculated in S146), and the access frequency (calculated in S146). Alternatively, even if the access frequency 622) is changed, the key conversion function determined to have the highest possibility that the selected key conversion function does not change is selected from two or more key conversion function candidates. Specifically, it is as follows.

L160: Loop start. The function determination program 100 repeats the process up to the loop end L162 for each key conversion function candidate represented by the candidate list. In each loop, a key conversion function candidate to be referred to is “function f”. The loop is as follows.

S162: The function determination program 100 corresponds to the data distribution 61 different from the data distribution 61 corresponding to the data distribution calculated in S150, the access source number calculated in S146, and the access frequency calculated in S146. The key conversion function corresponding to the address number 621 and the access frequency 622 is specified from the key conversion function determination table 306. The identified key conversion function is defined as “function f1”.

S164: The function determination program 100 sets the address number 621 different from the address number 621 corresponding to the access source number calculated in S146, the data distribution calculated in S150, and the data corresponding to the access frequency calculated in S146. A key conversion function corresponding to the degree of distribution 61 and the access frequency 622 is specified from the key conversion function determination table 306. The identified key conversion function is defined as “function f2.”

S166: The function determination program 100 sets the access frequency 622 different from the access frequency 622 corresponding to the access frequency calculated in S146, the data distribution calculated in S150, and the data corresponding to the access source number calculated in S146. A key conversion function corresponding to the degree of distribution 61 and the address number 621 is specified from the key conversion function determination table 306. The identified key conversion function is defined as “function f3”.

S168: The function determination program 100 determines how many of the functions f1, f2, and f3 match the function f, and associates the determined number (matching number) with, for example, the function f (name). Register in a predetermined area (for example, candidate list).

L162: Loop end.

Specific examples of the above L160-162 are as follows. Assume that the calculated data distribution degree = 0.3, the calculated number of access sources = 5, and the calculated access frequency = 5 accesses / day. In this case, according to the example of FIG. 6, the corresponding data distribution 61 = “0.5 or less”, the number of addresses 621 = “10 or less”, and the access frequency 622 = “10 accesses / day or less”. Therefore, the function f = Salt. In S162, it is assumed that only the data distribution 61 is “0.5 or more” among the corresponding data distribution 61, the number of addresses 621, and the access frequency 622. Therefore, the function f1 = Salt is obtained. Similarly, in S164, the function f2 = FP is obtained, and in S166, the function f3 = FP is obtained. Since the function f = Salt, the number of matches obtained in S168 = 1.

S170: The function determination program 100 selects the function f having the largest number of matches among the two or more functions f (two or more key conversion function candidates represented by the candidate list). When there are a plurality of functions f having the largest number of matches, an arbitrary function f among the plurality of functions f may be selected.

By the above key conversion function narrowing process, two or more key conversion function candidates are narrowed down to one.

That is, in the key conversion function narrowing-down process, the function determination program 100 assumes that the corresponding range is incorrect for each of the calculated data distribution, the number of access sources, and the access frequency. A key conversion function is specified from 306. The higher the matching rate between the key conversion function thus identified and the function f, the more likely that the function f is the key conversion function that is most suitable for the calculated data distribution, the number of access sources, and the access frequency. It is considered high.

The key conversion function determination table 306 can be adjusted (updated) either manually or automatically. That is, in this embodiment, at least one of manual adjustment (setting) and automatic adjustment (update) of the key conversion function determination table 306 may be performed. For example, it is as follows.

Depending on the state of data stored in the data space 159 and the setting value of the key conversion function determination table 306, the key conversion function narrowing-down process can be performed every time the key conversion function determination process is performed.

Therefore, in the key conversion function determination process, the function determination program 100 may update the key conversion function determination table 306 in addition to the update of the key conversion function table 304 in S156.

Specifically, for example, in the key conversion function narrowing process, it is assumed that the salt data variance is 0.6 and the FP data variance is 0.3. Also assume that the number of access sources is 20, and the access frequency is 5 accesses / day.

Based on this assumption, there are two key function candidates f input to C142, Salt and FP. Therefore, the key conversion function narrowing-down process is performed.

When S162, S164, and S166 in the key conversion function narrowing process are applied to each of Salt and FP, they are as follows. That is, when the function f = Salt, the function f1 = FP, the function f2 = Salt, and the function f3 = FP, and therefore the number of matches = 1. In the case of the function f = FP, the function f1 = Salt, the function f2 = Salt, and the function f3 = Hash, so that the number of matches = 0.

According to the result of this key conversion function narrowing-down process, Salt is a more robust algorithm than FP.

Now, if the boundary value of the data distribution is 0.3 or less, FP is not selected in S152, C140, and S154, and only Salt should have been registered in the candidate list.

Therefore, the function determination program 100 updates the key conversion function determination table 306 in S156 based on the result of the key conversion function narrowing process. Specifically, the function determination program 100 sets the boundary value for at least one of the data distribution 61, the address number 621, and the access frequency 622 as the current boundary value and the calculated value (data distribution, access source Number, access frequency) or the like. Based on the above specific example, the function determination program 100 changes the boundary value of the data dispersion degree from “0.5” to the calculated data dispersion degree “0.3”. The boundary value after the change is an average value “0” of the boundary value “0.5” before the change and the calculated data dispersion degree “0.3” instead of the calculated data dispersion degree “0.3”. .4 "may be employed.

For example, the administrator can set or adjust the key conversion function determination table 306 via a setting screen (for example, GUI (Graphical User Interface)) shown in FIG. The setting screen is displayed on, for example, a display device (not shown) included in the data management apparatus 1 or a display device included in a maintenance apparatus (not shown) connected to the data management apparatus 1 via the network 4. Good. The maintenance device is an example of a computer used by an administrator. The client device 6 may also serve as a maintenance device.

12 may be a web browser screen, for example. The setting screen 70 displays at least a table 1201 of a table 1201 having the same contents as the key conversion function determination table 306 and a table 1202 having the same contents as the key conversion function table 304. Information on the setting screen 70 (information for display) may be provided by the function determination program 100.

The table 1201 is provided with a UI (User Interface) for changing the boundary value, such as the up / down button 72, for at least one boundary value among the data distribution degree, the number of addresses, and the access frequency. Using the UI, the administrator can change the boundary value to a desired value.

Also, the table 1201 is provided with a boundary value addition UI such as a button 74 for at least one boundary value among the data distribution degree, the number of addresses, and the access frequency. Using that UI, the administrator can add boundary values.

The setting screen 70 displays a UI 76 for starting the key conversion function determination process. Using the UI 76, the start interval of the timer 20 and the number of stored records (number of key data pairs) can be input as a trigger for starting the key conversion function determination process.

At least one of the tables 1201 and 1202 is provided with a UI for changing a key conversion function, such as a pull-down button 78. Using the UI, the administrator can change the key conversion function registered in the key conversion function determination table 306 or the key conversion function table 304 to another key conversion function. If the desired key conversion function does not exist in the key conversion function group 112, a new key conversion function is uploaded to the data management apparatus 1 using the UI, and the key conversion function designated by the administrator is the new key conversion function. May be replaced with a function.

By using a setting completion UI such as the completion button 82, the changed boundary value and key conversion function displayed on the setting screen 70 are reflected in the table 306 or 304.

As mentioned above, although one embodiment was described, this is an illustration for explaining the present invention, and the scope of the present invention is not limited to this embodiment. The present invention can be implemented in various other forms.

For example, the architecture to which the present invention is applied is not limited. Specifically, for example, the present invention can be applied to both a master-slave type (for example, HBase) and a P2P (Peer To Peer) (for example, Cassandra).

Further, the present invention can be applied even when a structured data database is employed instead of or in addition to the unstructured data database.

Also, the data infrastructure (data management system) may be the data management device 1 itself (one data management device 1) or a plurality of data management devices 1. In the latter case, the plurality of processors 10 included in the plurality of data management apparatuses 1 are examples of processor units, and the plurality of memories 12 included in the plurality of data management apparatuses 1 are examples of memory units. The plurality of communication interfaces 14 included in the data management apparatus 1 is an example of an interface unit. One or a plurality of databases 16 is an example of a database unit.

Further, for example, when the key conversion function applied to the target namespace is changed in the key conversion function determination process of FIG. 10, the function determination program 100 has one or more key conversions that have been applied to the target namespace so far. The key converted according to the function may be unified with the key converted according to the latest key conversion function (key change function after change). For example, as illustrated in FIG. 13, when the key conversion function applied to the target namespace “ns1: customer” is changed to “Hash”, the function determination program 100 changes all of the target namespace “ns1: customer”. The key (the key converted according to the key conversion functions “Salt (5)” and “FP” so far) may be changed to a key according to the latest key conversion function “Hash”. More specifically, the following may be performed for each key range of the target namespace “ns1: customer”. Take one key range as an example (hereinafter referred to as “target key range”).
The function determination program 100 calls the key reverse conversion program 108. The key reverse conversion program 108 reads all the keys belonging to the target key range from the target namespace “ns1: customer”, and performs reverse conversion of each key based on the key conversion function corresponding to the key range.
The function determination program 100 calls the key conversion program 106. The key conversion program 106 converts all the keys belonging to the target key range according to the latest key conversion function “Hash”, and converts the converted key into the target namespace “ns1: customer” or a new namespace. To store. The new name space may be a name space created by the function determination program 100 or another program. When a new name space is employed, the new name space may be the target namespace “ns1: customer”, and the existing target namespace “ns1: customer” may be saved (backed up).

1: Data management device

Claims

An interface unit that is one or more interface devices;
A memory unit that is at least one of one or more memories and one or more storage devices;
A processor unit that is connected to the interface unit and the memory unit and is one or more processors;
The memory unit stores key conversion method information which is information representing a key conversion method applied among a plurality of key conversion methods for each of a plurality of access ranges in the database unit;
The database unit is one or more databases having a plurality of data areas in which one or more key data pair units are stored,
Each of the one or more key data pair units is one or more key data pairs;
Each of the one or more key data pairs is a key and data pair;
Each of the plurality of access ranges is a range according to a set of a key data pair unit and a key range of the key data pair unit,
The processor unit receives a query through the interface unit, performs a data reference process or a data storage process on the database unit in response to the query, and returns a result of the query through the interface unit;
In each of the data reference process and the data storage process, the processor unit specifies a key conversion method corresponding to an access range in the process from the key conversion method information, and uses the specified key conversion method.
The processor unit executes a key conversion method determination process. In the key conversion method determination process, the processor unit is configured for each of the one or more key data pair units.
(A) A key conversion method suitable for the access pattern to which the access status of the key data pair unit belongs is selected from the plurality of key conversion methods,
(B) The association between the selected key conversion method and the access range according to the key data pair unit and the key range for the key data pair unit is recorded in the key conversion method information.
Data management system.
The access status for each of the one or more key data pair units is:
The number of access sources that is the number of access sources of the key data pair unit,
A reference frequency, which is the frequency with which the key data pair unit has received a reference query that is referred to as a reference source,
The key data pair unit is at least one of the reference data number that is the number of data referenced by the query that is the reference source,
The data management system according to claim 1.
Each of the one or more key data pair units is a namespace.
The data management system according to claim 1.
The processor unit is
Accepting an upload of a key conversion method added to the plurality of key conversion methods through the interface unit;
Adding the accepted key conversion method to the plurality of key conversion methods;
The data management system according to claim 1.
If the first key conversion method that is the key conversion method selected in (A) and the second key conversion method that is the latest key conversion method for the key data pair unit are different, in (B), The processor unit
(B1) Determine a key range corresponding to the first key conversion method,
(B2) recording a correspondence between the new key range and the second key conversion method in the key conversion method information;
The data management system according to claim 1.
The processor unit executes the key conversion method determination process periodically or whenever a fixed amount or a fixed number of key data pairs are stored in the database unit.
The data management system according to claim 1.
If the first key conversion method that is the key conversion method selected in (A) is different from the second key conversion method that is the latest key conversion method for the key data pair unit, the processor unit
Change all the keys converted by the second key conversion method in the key data pair unit to keys converted by the first key conversion method,
Changing the second key conversion method associated with the key data pair unit to the first key conversion method;
The data management system according to claim 1.
In (A), when there are two or more key conversion methods suitable for the access pattern to which the access status of the key data pair unit belongs, the processor unit can select an access pattern belonging to the access status of the key data pair unit. Even if some of them are different, select the key conversion method that is most likely to be the same.
The data management system according to claim 1.
The processor unit adjusts at least one of a plurality of access patterns based on a result of the key conversion method narrowing-down process;
The data management system according to claim 8.
In (A), the processor unit selects a key conversion method suitable for an access pattern to which the access status belongs among a plurality of access patterns,
Each of the plurality of access patterns includes at least one of a data distribution range, an access source number range, and a reference frequency range,
The data distribution is the number of reference key data pair units that exist per data area, and is a number based on the number of reference data and the number of data areas,
The reference key data pair unit is a key data pair unit to be referred to.
The data management system according to claim 2.
In (A), the processor unit calculates a data distribution degree for each of the plurality of key conversion methods by a calculation method corresponding to the key conversion method (F), and the calculated data distribution degree and the The key conversion method (f) corresponding to the access pattern belonging to the access status is specified, and when the key conversion method (F) matches the key conversion method (f), the key conversion method (F) is converted to the key conversion method. As a method candidate,
If there is one key conversion method candidate, in (A), the processor unit selects the key conversion method.
The data management system according to claim 10.
If there are a plurality of key conversion method candidates, in (A), the processor unit executes a key conversion method narrowing-down process,
In the key conversion method narrowing-down process, the processor unit accesses each of the plurality of key conversion method candidates (f) including a range different from a range to which the data distribution degree for the key conversion method (f) belongs. A key conversion method corresponding to a pattern, a key conversion method corresponding to an access pattern including a range different from the range to which the access source number belongs in the access situation, and a range different from the range to which the reference frequency belongs in the access situation. Identifying at least one of the key conversion methods corresponding to the included access pattern;
In (A), the processor unit selects a key conversion method candidate (f) that most closely matches the identified key conversion function among the plurality of key conversion method candidates (f).
The data management system according to claim 11.
The processor unit includes the following (x1) to (x3), referred to in the key conversion method narrowing process:
(X1) Data distribution degree for the selected key conversion method candidate (f),
(X2) the number of access sources in the access situation, and
(X3) Reference frequency in the access status,
Changing the boundary value for at least one of the data distribution range, the access source number range, and the reference frequency range based on at least one of the following:
The data management system according to claim 12.
A data management method for a system in which data reference processing or data storage processing for a database unit is performed in response to a query,
For each of one or more key data pair units in a database unit that is one or more databases having a plurality of data areas in which one or more key data pair units are stored,
(A) A key conversion method suitable for the access pattern to which the access status of the key data pair unit belongs is selected from the plurality of key conversion methods,
(B) Correspondence between the selected key conversion method and an access range according to the key data pair unit and a key range for the key data pair unit, and a plurality of keys for each of the plurality of access ranges in the database unit Record in the key conversion method information, which is information representing the applied key conversion method of the conversion methods,
Each of the one or more key data pair units is one or more key data pairs;
Each of the one or more key data pairs is a key and data pair;
Each of the plurality of access ranges is a range according to a set of a key data pair unit and a key range of the key data pair unit,
In each of the data reference process and the data storage process, the system specifies a key conversion method corresponding to an access range in the process from the key conversion method information, and uses the specified key conversion method.
Data management method.
A non-transitory computer-readable recording medium in which a computer program executed by a computer that performs data reference processing or data storage processing on a database unit in response to a query is recorded in a computer-readable manner,
For each of one or more key data pair units in a database unit that is one or more databases having a plurality of data areas in which one or more key data pair units are stored,
(A) A key conversion method suitable for the access pattern to which the access status of the key data pair unit belongs is selected from the plurality of key conversion methods,
(B) Correspondence between the selected key conversion method and an access range according to the key data pair unit and a key range for the key data pair unit, and a plurality of keys for each of the plurality of access ranges in the database unit Record in key conversion method information, which is information representing the applied key conversion method of the conversion methods,
Causing the computer to execute
Each of the one or more key data pair units is one or more key data pairs;
Each of the one or more key data pairs is a key and data pair;
Each of the plurality of access ranges is a range according to a set of a key data pair unit and a key range of the key data pair unit,
In each of the data reference processing and the data storage processing, the computer specifies a key conversion method corresponding to an access range in the processing from the key conversion method information, and uses the specified key conversion method.
Non-transitory computer-readable recording medium.