WO2017088666A1 - 一种数据存储方法和协调节点 - Google Patents

一种数据存储方法和协调节点 Download PDF

Info

Publication number
WO2017088666A1
WO2017088666A1 PCT/CN2016/105243 CN2016105243W WO2017088666A1 WO 2017088666 A1 WO2017088666 A1 WO 2017088666A1 CN 2016105243 W CN2016105243 W CN 2016105243W WO 2017088666 A1 WO2017088666 A1 WO 2017088666A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
storage area
node
distribution
key
Prior art date
Application number
PCT/CN2016/105243
Other languages
English (en)
French (fr)
Inventor
张金玉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP16867890.2A priority Critical patent/EP3373158B1/en
Publication of WO2017088666A1 publication Critical patent/WO2017088666A1/zh
Priority to US15/989,315 priority patent/US20180276252A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9017Indexing; Data structures therefor; Storage structures using directory or table look-up
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0613Improving I/O performance in relation to throughput

Definitions

  • Embodiments of the present invention relate to the field of data information management technologies, and in particular, to a data storage method and a coordination node.
  • the distributed data management system includes at least one Coordinate Node (CN) and at least one Data Node (DN). Any coordination node and any data node can communicate, and any two data nodes can also communicate.
  • CN Coordinate Node
  • DN Data Node
  • the coordination node includes a global optimization querier, and the global optimization querier partitions or fragments the data table according to a certain rule. For example, one or more columns in the data table are used as columns of the distribution keys of the data table, and then partitioned or sliced according to the hash value of the columns of the data according to a distribution key.
  • the data in the data table after the partition or the slice is separately stored in a plurality of different data nodes according to a distribution key, so that the amount of data stored in each data node is uniformly distributed.
  • Each data node can manage and manipulate its own stored data according to the instructions of the global optimized querier. Since there are many types of data tables covered in the database, among the data stored in each data node, the columns of each data table as distribution keys may be the same or different.
  • the user logs in to the client, connects to the database, and issues a data query command to the coordination node.
  • the global optimization querier of the coordination node receives the data query command, parses the data query command, and generates a data query plan. And generate the generated data query plan Go to each data node.
  • Each data node performs a data query after receiving the data query plan. If the distribution key of the associated data table is the same, the associated data can be correlated within the same node, and the result after the association operation is sent to the coordination node, thereby improving the data query efficiency.
  • Embodiments of the present invention provide a data storage method and a coordination node for efficiently storing data to match the requirements of current high efficiency queries.
  • An embodiment of the present invention provides a data storage method, including the following steps:
  • the coordination node CN determines a plurality of distribution keys corresponding to the data table identifier according to the data table identifier corresponding to the obtained one piece of data;
  • the CN sends, according to the plurality of distribution keys, data to the at least one data node and the storage area identifier corresponding to each of the at least one data node, wherein each of the at least one data node corresponds to at least one of the plurality of distribution keys,
  • the storage area identifier of each of the at least one data node represents at least one of the plurality of distribution keys corresponding to each data node, and the storage area identifier corresponding to each of the at least one data node is each of the at least one data node
  • Data nodes are used to store data in the storage area of each data node.
  • the CN determines the multiple distribution keys corresponding to the data table identifier according to the data table identifier corresponding to the obtained one data, and sends the data to the at least one data node and the storage area identifier corresponding to the at least one data node according to the multiple distribution keys.
  • one piece of data can be stored by multiple distribution keys
  • the distribution key corresponding to the data is increased, thereby increasing the probability that the keyword used by the user to query the data is a distribution key, and reducing the user's query operation on the non-distributed key.
  • the probability of reducing the probability of data redistribution between nodes thereby reducing the overhead and resources such as network and content, and improving data query efficiency.
  • the CN sends the data and the corresponding storage area identifier of the at least one data node to the at least one data node according to the multiple distribution keys, including:
  • the CN transmits the data and the storage area identifier of the data node to a data node according to the plurality of distribution keys, and the storage area identifier is used by the data node to store the data in the common storage area of the data node.
  • the data is stored on each of the plurality of distribution keys of a data node, and the data is stored in a common storage area of a data node, thereby saving data storage space and reducing network load.
  • the CN sends the data and the corresponding storage area identifier of the at least one data node to the at least one data node according to the multiple distribution keys, including:
  • the CN transmits data and a storage area identifier corresponding to each of the plurality of data nodes to the plurality of data nodes according to the plurality of distribution keys, and the storage area identifier corresponding to each of the plurality of data nodes is used by each of the plurality of data nodes.
  • the data is stored in at least one storage area of each data node, and the at least one storage area is a private storage area of each of the at least one distribution key corresponding to each data node.
  • the storage location of the data on the data node can be accurately determined according to the queried distribution key, thereby improving the data query efficiency.
  • it also includes:
  • the CN obtains at least one historical query record corresponding to the data table identifier; wherein each history query record includes an association key, and the association key represents a keyword used for querying data in the historical query record;
  • the CN determines, according to the at least one historical query record, at least one associated key that occurs in the at least one historical query record and has a frequency greater than a threshold;
  • the CN determines the at least one associated key as at least one new distribution key corresponding to the data table identifier
  • the CN updates the correspondence between the data table identifier and the distribution key according to at least one new distribution key.
  • the probability that the keyword used by the user to query the data is a distribution key is further improved. It reduces the probability of users performing association operations on non-distributed keys, thereby reducing the probability of data redistribution between nodes, thereby reducing overhead and resources such as network and content, and improving data query efficiency.
  • An embodiment of the present invention provides a coordination node, including:
  • a determining unit configured to determine, according to the obtained data table identifier corresponding to one piece of data, a plurality of distribution keys corresponding to the data table identifier
  • a sending unit configured to send, to the at least one data node, a storage area identifier corresponding to each of the at least one data node according to the plurality of distribution keys, where each of the at least one data node corresponds to at least one of the plurality of distribution keys a distribution key, the storage area identifier of each of the at least one data node represents at least one of the plurality of distribution keys corresponding to each data node, and the storage area identifier corresponding to each of the at least one data node is at least one data node
  • Each of the data nodes is used to store data in the storage area of each data node.
  • the CN determines the multiple distribution keys corresponding to the data table identifier according to the data table identifier corresponding to the obtained one data, and sends the data to the at least one data node and the storage area identifier corresponding to the at least one data node according to the multiple distribution keys.
  • one piece of data can be stored by multiple distribution keys, and the effective data storage method increases the distribution key corresponding to the data, thereby improving the probability that the keyword used by the user to query the data is a distribution key, and the probability is lowered.
  • the probability of the user performing the query operation on the non-distributed key thereby reducing the probability of data redistribution between the nodes, thereby reducing the overhead and resources such as network and content, and improving the data query efficiency.
  • the sending unit is specifically configured to:
  • the storage area identifiers of the data and data nodes are transmitted to a data node according to a plurality of distribution keys, and the storage area identifiers are used by the data nodes to store the data in a common storage area of the data nodes.
  • the data is stored on each of the plurality of distribution keys of a data node, and the data is stored in a common storage area of a data node, thereby saving data storage. Space, reducing network load.
  • the sending unit is specifically configured to:
  • each of the plurality of data nodes corresponding to the storage area identifier is used by each of the plurality of data nodes to use the data
  • Each of the data nodes is stored in at least one storage area of each of the data nodes, and the at least one storage area is a private storage area of each of the at least one distribution key corresponding to each of the data nodes.
  • the storage location of the data on the data node can be accurately determined according to the queried distribution key, thereby improving the data query efficiency.
  • a processing unit is further included for:
  • each history query record includes an association key, and the association key represents a keyword used by the query data in the historical query record;
  • the correspondence between the data table identifier and the distribution key is updated according to at least one new distribution key.
  • the probability that the keyword used by the user to query the data is a distribution key is further improved. It reduces the probability of users performing association operations on non-distributed keys, thereby reducing the probability of data redistribution between nodes, thereby reducing overhead and resources such as network and content, and improving data query efficiency.
  • An embodiment of the present invention provides a coordination node, including:
  • a storage configured to store a correspondence between the data table identifier and the plurality of distribution keys, and provide a correspondence between the data table identifier and the plurality of distribution keys to the processor;
  • a processor configured to determine, according to the obtained data table identifier corresponding to one piece of data, a plurality of distribution keys corresponding to the data table identifier by using a memory;
  • a storage area identifier for transmitting data to the at least one data node and the at least one data node by the transceiver according to the plurality of distribution keys, wherein each of the at least one data node corresponds to the plurality of distribution keys At least one distribution key, the storage area identifier of each of the at least one data node represents at least one of the plurality of distribution keys corresponding to each data node, and the storage area identifier corresponding to each of the at least one data node is at least one data Each data node in the node is used to store data in a storage area of each data node;
  • a transceiver configured to send data to the at least one data node and a storage area identifier corresponding to each of the at least one data node.
  • the CN determines the multiple distribution keys corresponding to the data table identifier according to the data table identifier corresponding to the obtained one data, and sends the data to the at least one data node and the storage area identifier corresponding to the at least one data node according to the multiple distribution keys.
  • one piece of data can be stored by multiple distribution keys, and the effective data storage method increases the distribution key corresponding to the data, thereby improving the probability that the keyword used by the user to query the data is a distribution key, and the probability is lowered.
  • the probability of the user performing the query operation on the non-distributed key thereby reducing the probability of data redistribution between the nodes, thereby reducing the overhead and resources such as network and content, and improving the data query efficiency.
  • the processor is specifically configured to:
  • the transceiver is specifically configured to send the data area and the storage area identifier of the data node to a data node.
  • the data is stored on each of the plurality of distribution keys of a data node, and the data is stored in a common storage area of a data node, thereby saving data storage space and reducing network load.
  • the processor is specifically configured to:
  • the transceiver is specifically configured to send data to the plurality of data nodes and a storage area identifier corresponding to each of the plurality of data nodes.
  • the storage location of the data on the data node can be accurately determined according to the queried distribution key, thereby improving the data query efficiency.
  • the processor is further configured to:
  • each history query record includes an association key, and the association key represents a keyword used by the query data in the historical query record;
  • Corresponding relationship between the data table identifier stored in the memory and the distribution key is updated according to at least one new distribution key.
  • the probability that the keyword used by the user to query the data is a distribution key is further improved. It reduces the probability of users performing association operations on non-distributed keys, thereby reducing the probability of data redistribution between nodes, thereby reducing overhead and resources such as network and content, and improving data query efficiency.
  • the CN determines a plurality of distribution keys corresponding to the data table identifier according to the data table identifier corresponding to the acquired one data; the CN sends the data to the at least one data node and the at least one data node according to the plurality of distribution keys.
  • Corresponding storage area identifier wherein each of the at least one data node corresponds to at least one distribution key, and the storage area identifier of each of the at least one data node represents at least one distribution key corresponding to each data node, at least The respective storage area identifiers of one data node are used by each of the at least one data node to store data in the storage area of each of the data nodes.
  • the effective data storage method increases the distribution key corresponding to the data, thereby improving the probability that the keyword used by the user to query the data is a distribution key, and reducing the user's query operation on the non-distributed key.
  • the probability of data redistribution between nodes is reduced, which reduces the overhead and resources such as network and content, improves data query efficiency, and matches the needs of current high-efficiency queries.
  • FIG. 1 is a schematic structural diagram of a distributed data management system according to an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of a data management method implemented on a coordination node side according to an embodiment of the present invention
  • FIG. 2a is a schematic structural diagram of internal data storage of a data node according to an embodiment of the present invention
  • 2b is a schematic diagram of a query method when querying data in a data node according to an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a coordination node according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of another coordination node according to an embodiment of the present invention.
  • the identifier in the embodiment of the present invention is used to identify an object, and the object may be: a data table, a storage area, such as a data table identifier, a storage area identifier, and the like in the embodiment of the present invention.
  • An identification can include At least one of the name, the number, and the ID (Identification), as long as the identified object can be distinguished from other objects.
  • the data table represents a table consisting of a set of data corresponding to the same set of distribution keys, and all data corresponding to one data table identifier corresponds to the same set of distribution keys, and the same set of distribution keys includes multiple distribution keys. .
  • FIG. 1 is a schematic diagram showing the architecture of a distributed data management system to which an embodiment of the present invention is applied.
  • the system includes at least one coordination node 102 and at least one data node 101. Any of the coordination nodes 102 and any of the data nodes 101 can communicate, and any two data nodes 101 can also communicate.
  • the user inputs data through the network device 103, coordinates the interaction between the node 102 and the network device 103, and acquires information such as data from the network device.
  • the coordination node uses one or more columns of one or more data as a distribution key of the data table, and then hashes the one or more data according to the distribution key to obtain a hash value corresponding to each data, according to The hash value stores each piece of data in the data node so that the amount of data stored in each data node is evenly distributed.
  • data is stored as a distribution key.
  • the distribution key of table A is aa
  • the distribution key of table B is aa
  • the distribution key of table C is bb.
  • the association keys of Table A and Table B are all aa, that is, both Table A and Table B are associated on the distribution key aa. Therefore, Table A and Table B can be associated within the same node.
  • the association key of Table B is bb
  • the association key of Table C is aa, since the distribution keys of Table B and Table C are associated with each other.
  • the data node will redistribute the table C with the association key aa, and redistribute the table B with the association key bb, so as to achieve the purpose of association within the same node.
  • the probability that the keyword used by the user to query the data is the distribution key corresponding to the data is small, and the user is in the non-
  • the probability of doing query operations such as association on the distribution key is very large, so the probability of data redistribution between nodes is also large.
  • the data is stored by a distribution key in the prior art, and then the data is queried. In the process, the network overhead is large, the resources are consumed more, and the query efficiency is low.
  • FIG. 2 is a schematic flowchart of a data storage method implemented on the CN side according to an embodiment of the present invention, including the following steps:
  • Step 201 The coordination node CN determines, according to the obtained data table identifier corresponding to one piece of data, a plurality of distribution keys corresponding to the data table identifier;
  • Step 202 The CN sends, according to the multiple distribution keys, the data and the storage area identifier corresponding to the at least one data node to the at least one data node, where each of the at least one data node corresponds to at least one of the plurality of distribution keys.
  • the distribution key, the storage area identifier of each of the at least one data node represents at least one of the plurality of distribution keys corresponding to each data node, and the storage area identifier corresponding to each of the at least one data node is in the at least one data node
  • Each data node is used to store data in the storage area of each data node.
  • the CN determines the multiple distribution keys corresponding to the data table identifier according to the data table identifier corresponding to the obtained one data, and sends the data to the at least one data node and the storage area identifier corresponding to the at least one data node according to the multiple distribution keys.
  • one piece of data can be stored by multiple distribution keys, and the effective data storage method increases the distribution key corresponding to the data, thereby improving the probability that the keyword used by the user to query the data is a distribution key, and the probability is lowered.
  • the probability of the user performing the query operation on the non-distributed key thereby reducing the probability of data redistribution between the nodes, thereby reducing the overhead and resources such as network and content, and improving the data query efficiency.
  • each of the at least one data node corresponds to at least one of the plurality of distribution keys.
  • the first distribution key corresponds to the first data node and the second data node, that is, the data is stored by the first distribution key
  • the second distribution key may correspond to the second data node and the third data node, that is, when the data is stored by the second distribution key, the second data node may be stored in the second data node. Or a third data node.
  • the storage area identifier of each of the at least one data node represents at least one of a plurality of distribution keys corresponding to each of the data nodes.
  • An optional implementation manner is that the storage area identifier of each of the at least one data node is the one of the plurality of distribution keys corresponding to each data node.
  • One less distribution key another optional implementation manner is that the storage area identifier of each data node in the at least one data node is: a storage area corresponding to at least one of the plurality of distribution keys corresponding to each data node Logo.
  • the distribution key is “class”
  • the storage area of the data node can be divided into at least one area, where there is a storage area corresponding to the “class”, and the distribution key is the storage corresponding to “class”.
  • the identifier of the area may be an identifier that can identify the storage area such as "001", or the identifier of the storage area corresponding to the "class" of the distribution key is a distribution key "class".
  • the user creates a data table by creating a structured table (Structured Query Language, SQL for short) statement, and the data table corresponds to a data table identifier, thereby specifying multiple distributions for the created data table.
  • the key and the information of the distribution key corresponding to the data identifier are stored in the CN or stored in other devices connected to the CN, so that the CN can obtain the information of the distribution key corresponding to the data identifier from other devices connected to the CN.
  • the mapping between the data table identifier and the distribution key is pre-configured in the CN.
  • the CN pre-stores the correspondence between the data table identifier and the distribution key in the CN according to the SQL statement, and one data table identifier may correspond to one or more distribution keys, and one data table identifier corresponds to multiple The distribution keys are different, wherein any one of the distribution keys may identify any one or a combination of any of the plurality of columns of the corresponding data for the data table.
  • the embodiment of the present invention does not limit this. The following content is only described by taking one data as an example.
  • the user inputs one or more pieces of data through the system for the user, and each piece of data includes multiple columns.
  • the CN obtains the data entered by the user in real time or periodically, or passes the data. Insert statements insert data, or you can import data in bulk by copying commands or other tools. Regardless of the data entered, inserted or imported by the user, the CN processes the acquired data according to the strip, and the CN can obtain the identifier of the data table that should be stored for each piece of data, that is, the CN can determine a data table corresponding to one piece of data. logo. After acquiring a piece of data, the CN determines a data table identifier corresponding to the piece of data, and a plurality of distribution keys corresponding to the data table identifier, and stores the data according to the distribution key.
  • the determined data is based on the data node that should be stored by each distribution key.
  • the data table identifier pair corresponding to the data
  • Each distribution key should hash the data according to the distribution key, obtain a hash value corresponding to the data, and determine a data node that the data should be stored according to the determined hash value.
  • the CN sends the data area and the storage area identifier of the data node to the data node according to the plurality of distribution keys, and the storage area identifier is determined by the data node. Used to store data in a common storage area of a data node.
  • the CN sends the data and the storage area identifier corresponding to each of the plurality of data nodes to the plurality of data nodes according to the plurality of distribution keys;
  • the storage area identifier corresponding to each of the plurality of data nodes is used by each of the plurality of data nodes to store data in at least one storage area of each data node, and at least one storage area corresponds to each data node.
  • At least one of the distribution keys has its own private storage area.
  • Manner 2 for each of the plurality of distribution keys, determining, according to the distribution key, a data node to which the data should be stored, and a storage area identifier corresponding to the data node, where the CN data and data are at the data node Corresponding storage area identifier is sent to the data node; wherein the storage area identifier is an identifier of the private storage area corresponding to the distribution key, and the data node stores the data in the private storage area corresponding to the distribution key according to the storage area identifier; The data is stored on the same node according to all the distribution keys in turn, and the data is stored in the common storage area corresponding to all the distribution keys of the data node, and the data stored in other storage areas on the data node is stored. delete.
  • FIG. 2a is a schematic structural diagram showing the internal data storage of a data node to which the embodiment of the present invention is applied.
  • one coordination node 2101 connects three data nodes, which are data node 1, data node 2, and data node 3.
  • the data storage structure in each of the three data nodes is similar.
  • the data node 1 is taken as an example.
  • the distribution key is respectively 1.
  • the CN obtains a piece of data
  • the number corresponding to the piece of data is obtained.
  • the table identifier according to the correspondence between the preset data table identifier and the distribution key, and the obtained data table identifier corresponding to the piece of data, a plurality of distribution keys corresponding to the data table identifier are determined. Based on each of the plurality of distribution keys, the determined data node should be stored.
  • the CN sends the data area and the storage area identifier of the data node to the determined one data node according to the plurality of distribution keys.
  • the data node The storage area identifier is an identifier of a common storage area corresponding to all of the plurality of distribution keys on the data node.
  • the CN transmits the data and the identification of the common storage area of the data node 1 to the data node 1 so that the data node 1 stores the data in the common storage area of the data node 1.
  • the CN transmits the data and the storage area identifier corresponding to each of the plurality of data nodes to the plurality of data nodes according to the plurality of distribution keys. For example, when N is 4, the data is hashed according to the distribution key 1. According to the obtained hash value, the data should be stored in the data node 1, and the data is hashed according to the distribution key 2, according to the obtained hash. The hash value determines that the data should be stored in the data node 2, and the data is hashed according to the distribution key 3. According to the obtained hash value, the data should be stored in the data node 3, and the data is distributed according to the distribution key 4.
  • the CN sends data to the data node 1, and the identifier of the private storage area corresponding to the distribution key 1 and the identification of the private storage area corresponding to the distribution key 4, so that the data node 1 stores the data in the distribution key of the data node 1.
  • the CN sends data to the data node 2, and the identification of the private storage area corresponding to the distribution key 2, so that The data node 2 stores the data in a private storage area corresponding to the distribution key 2 of the data node 2; the CN transmits data to the data node 3, and the identification of the private storage area corresponding to the distribution key 3, so that the data node 3 associates the data Pair of distribution keys 3 stored in data node 3
  • the private store should be.
  • the specific example is N.
  • the CN sends data to the data node 1, and the distribution key. 1 Corresponding identification of the private storage area, so that the data node 1 stores the data in the private storage area corresponding to the distribution key 1 in the data node 1. It is determined according to the distribution key 2 that the data should be stored in the data node 1, at which time the CN sends data to the data node 1, and the identification of the private storage area corresponding to the distribution key 1, so that the data node 1 stores the data in the data.
  • the private storage area corresponding to the key 2 is distributed in the node 1. It is determined according to the distribution key 3 that the data should be stored in the data node 1. At this time, the CN sends data to the data node 1, and the identification of the private storage area corresponding to the distribution key 1, so that the data node 1 stores the data in the data. The private storage area corresponding to the key 3 is distributed in the node 1. It is determined according to the distribution key 4 that the data should be stored in the data node 1, at which time the CN sends data to the data node 1, and the identification of the private storage area corresponding to the distribution key 1, so that the data node 1 stores the data in the data. The private storage area corresponding to the key 4 is distributed in the node 1.
  • the data is stored in the common storage area in the data node 1, and the data node is deleted.
  • the data stored in the private storage area corresponding to the distribution key 1, the data stored in the private storage area corresponding to the delete distribution key 2, the data stored in the private storage area corresponding to the delete distribution key 3, and the delete distribution key 4 The data stored by the private store.
  • the CN sends the data to the data node 1, And the identifier of the private storage area corresponding to the distribution key 1 so that the data node 1 stores the data in the private storage area corresponding to the distribution key 1 in the data node 1. It is determined according to the distribution key 2 that the data should be stored in the data node 2, at this time, the CN sends data to the data node 2, and the identification of the private storage area corresponding to the distribution key 2, so that the data node 2 stores the data in the data.
  • the private storage area corresponding to the key 2 is distributed in the node 2.
  • the distribution key 3 It is determined according to the distribution key 3 that the data should be stored in the data node 3, at which time the CN sends data to the data node 3, and the identification of the private storage area corresponding to the distribution key 3, so that the data node 3 stores the data in the data.
  • Private storage corresponding to distribution key 3 in node 3 Area.
  • the distribution key 4 It is determined according to the distribution key 4 that the data should be stored in the data node 1, at which time the CN sends data to the data node 1, and the identification of the private storage area corresponding to the distribution key 4, so that the data node 1 stores the data in the data.
  • the private storage area corresponding to the key 4 is distributed in the node 1. At this time, it is determined that the data nodes determined according to the four distribution keys should be stored in a plurality of data nodes, and are distributed as the data node 1, the data node 2, and the data node 3.
  • the CN obtains at least one historical query record corresponding to the data table identifier by using a preset duration; wherein each history query record includes an association key, and the association key indicates The keyword used in the query data in the historical query record; the CN determines, according to the at least one historical query record, at least one associated key having a frequency greater than a threshold in the at least one historical query record; and the CN determines the at least one associated key as the data table Identifying at least one new distribution key; the CN updates the correspondence between the data table identifier and the distribution key according to the at least one new distribution key.
  • the historical query record includes an equivalence association condition
  • the equivalence association condition includes an association key.
  • the historical query record can be a SQL statement.
  • find the equivalence association condition in the historical query record determine the association key, and use the associated key as the new distribution key.
  • the association key B is set as a new distribution key, and the correspondence between the preset data table identifier and the distribution key is updated, and after the update, the data table identifier correspondingly is distributed. Key B.
  • the CN automatically redistributes the data in the data node by using the received command issued by the user, or the CN automatically uses the updated distribution key B.
  • the data table identifier originally corresponds to the distribution key A. After analyzing the historical query record, it is found that the frequency of occurrence of the association key A, the association key B and the association key C are greater than the threshold value.
  • the association key A, the association key B, and the association key C are all set as new distribution keys, and the corresponding relationship between the updated data table identifier and the distribution key is, the data table identifier corresponding to the distribution key A, the distribution key B, and the distribution key C .
  • the CN redistributes the data in the data node with the updated distribution key B and distribution key C.
  • the process of distribution is specifically that the CN sends a redistribution command to each data node, and each node determines, according to the new distribution key B and the distribution key C, which data is stored by itself and which belongs to other data nodes, and belongs to other
  • the data stored by the data node is sent to other data nodes.
  • the data node 1 determines that one piece of data stored by itself should be stored in the data node 4, and the data node 1 transmits the data to the data node 4, which receives the data and stores it.
  • the SQL statement for data distribution of the data table TB_SVC_SUBS_HIST by the distribution key A, the distribution key B, and the distribution key C may be:
  • the user logs in to the client, connects to the database, and issues a data query command.
  • the global optimization querier of the CN receives the data query command, parses the data query command, generates a data query plan, and distributes the generated data query plan to each data node, and each data node receives the data query plan after receiving the data query plan. , for data query.
  • the data node For any data node, the data node performs a hash operation on the received data query plan, and determines a location where the data is stored according to the result obtained by the hash operation, for example, the data node 1 determines that the data is in the distribution key 1 The corresponding private storage area, the private storage area corresponding to the distribution key 4, or the common storage area corresponding to all the distribution keys of the data node. Further, the data is scanned at the location where the determined data is stored.
  • the probability that the association key queried in the query plan is a distribution key is large, and the probability that the data to be associated is in the same data node is Also, if it is determined that two data are stored in the same data node, the data is associated in each data node storing the data, and each data node sends the obtained association result to the CN, and the CN associates After the result is merged, the queried data is returned to the client and presented to the user. If it is determined that two data are stored in different data nodes, data transmission is performed between the data nodes so that the two data are stored in the same data node.
  • FIG. 2b is a schematic diagram showing a query method when querying data in a data node according to an embodiment of the present invention.
  • data 1 is distributed node A and distribution B at data node 1
  • data Stored in node 2 data 2 is distributed by distribution key A
  • data 3 is distributed by distribution key B.
  • the data node 1 When the user needs to associate the data 1 and the data 2 with the distribution key A, and associate the data 1 and the data 3 with the distribution key B, the data node 1 is taken as an example, and the data node 1 queries the distribution key A corresponding to the data node 1
  • the private storage area and the common storage area associate data 1 and data 2 in the data node 1; the data node 1 queries the private storage area and the common storage area corresponding to the distribution key B in the data node 1, and is in the data node 1 Data 1 and Data 3 are associated. Both the data node 2 and the data node 3 perform the similar method flow described above, and details are not described herein again.
  • the SQL statement for associating data 1 and data 2 with distribution key A is as follows, data 1 is TB_SVC-SUBS-HIST, data 2 is DIL-BILL, and distribution key A is SUBS-ID:
  • the SQL statement for associating data 1 and data 3 with distribution key B is as follows, data 1 is TB_SVC-SUBS-HIST, data 3 is GPRS-CDR, and distribution key B is MSISDN:
  • the CN determines, according to the data table identifier corresponding to the acquired one data, multiple distribution keys corresponding to the data table identifier; the CN sends the data to the at least one data node according to the multiple distribution keys.
  • each of the at least one data node corresponds to at least one distribution key
  • the storage area identifier of each of the at least one data node indicates that each data node corresponds to At least one distribution key
  • the storage area identifier corresponding to each of the at least one data node is used by each of the at least one data node to store data in a storage area of each of the data nodes.
  • the CN determines the multiple distribution keys corresponding to the data table identifier according to the data table identifier corresponding to the obtained one data, and sends the data to the at least one data node and the storage area identifier corresponding to the at least one data node according to the multiple distribution keys.
  • one piece of data can be stored by multiple distribution keys, and the effective data storage method increases the distribution key corresponding to the data, thereby improving the probability that the keyword used by the user to query the data is a distribution key, and the probability is lowered.
  • the probability of a user performing a query operation on a non-distributed key, from The probability of data redistribution between nodes is reduced, thereby reducing the overhead and resources such as network and content, and improving data query efficiency.
  • FIG. 3 is a schematic structural diagram of a coordination node according to an embodiment of the present invention.
  • a coordination node 300 is provided to perform the foregoing method, as shown in FIG. 3, including a determining unit 301, a sending unit 302, and optionally, a processing unit 303:
  • a determining unit configured to determine, according to the obtained data table identifier corresponding to one piece of data, a plurality of distribution keys corresponding to the data table identifier
  • a sending unit configured to send, to the at least one data node, a storage area identifier corresponding to each of the at least one data node according to the plurality of distribution keys, where each of the at least one data node corresponds to at least one of the plurality of distribution keys a distribution key, the storage area identifier of each of the at least one data node represents at least one of the plurality of distribution keys corresponding to each data node, and the storage area identifier corresponding to each of the at least one data node is at least one data node
  • Each of the data nodes is used to store data in the storage area of each data node.
  • the sending unit is specifically configured to:
  • the storage area identifiers of the data and data nodes are transmitted to a data node according to a plurality of distribution keys, and the storage area identifiers are used by the data nodes to store the data in a common storage area of the data nodes.
  • the sending unit is specifically configured to:
  • each of the plurality of data nodes corresponding to the storage area identifier is used by each of the plurality of data nodes to use the data
  • Each of the data nodes is stored in at least one storage area of each of the data nodes, and the at least one storage area is a private storage area of each of the at least one distribution key corresponding to each of the data nodes.
  • a processing unit is further included for:
  • each history query record includes an association key, and the association key represents a keyword used by the query data in the historical query record;
  • the correspondence between the data table identifier and the distribution key is updated according to at least one new distribution key.
  • the CN determines, according to the data table identifier corresponding to the acquired one data, multiple distribution keys corresponding to the data table identifier; the CN sends the data to the at least one data node according to the multiple distribution keys.
  • each of the at least one data node corresponds to at least one distribution key
  • the storage area identifier of each of the at least one data node indicates that each data node corresponds to At least one distribution key
  • the storage area identifier corresponding to each of the at least one data node is used by each of the at least one data node to store data in a storage area of each of the data nodes.
  • the CN determines the multiple distribution keys corresponding to the data table identifier according to the data table identifier corresponding to the obtained one data, and sends the data to the at least one data node and the storage area identifier corresponding to the at least one data node according to the multiple distribution keys.
  • one piece of data can be stored by multiple distribution keys, and the effective data storage method increases the distribution key corresponding to the data, thereby improving the probability that the keyword used by the user to query the data is a distribution key, and the probability is lowered.
  • the probability of the user performing the query operation on the non-distributed key thereby reducing the probability of data redistribution between the nodes, thereby reducing the overhead and resources such as network and content, and improving the data query efficiency.
  • FIG. 4 is a schematic structural diagram of another coordination node according to an embodiment of the present invention.
  • a coordination node 400 is provided to implement the foregoing method flow.
  • the processor 401, the transceiver 403, and the memory 402 are:
  • a memory configured to store a correspondence between the data table identifier and the plurality of distribution keys, an instruction, and related data of the processor during the execution of the instruction, and provide the processor with a correspondence and an instruction of the data table identifier and the plurality of distribution keys;
  • a processor that reads instructions in memory and performs the following procedures:
  • a storage area identifier for transmitting data to the at least one data node and the at least one data node by the transceiver according to the plurality of distribution keys, wherein each of the at least one data node corresponds to the plurality of distribution keys At least one distribution key, the storage area identifier of each of the at least one data node represents at least one of the plurality of distribution keys corresponding to each data node, and the storage area identifier corresponding to each of the at least one data node is at least one data Each data node in the node is used to store data in a storage area of each data node;
  • transceiver configured to send data to the at least one data node and a storage area identifier corresponding to each of the at least one data node.
  • the transceiver is further configured to send signaling and data to another coordinating node, receive signaling and data sent by another coordinating node, and signal the data node, receive data sent by the data node, and the like.
  • the processor is specifically configured to:
  • the transceiver is specifically configured to send the data area and the storage area identifier of the data node to a data node.
  • the processor is specifically configured to:
  • the transceiver is specifically configured to send data to the plurality of data nodes and a storage area identifier corresponding to each of the plurality of data nodes.
  • the processor is further configured to:
  • the at least one historical query record corresponding to the data table identifier is obtained; wherein Each history query record includes an association key, and the association key represents a keyword used for querying data in the history query record;
  • Corresponding relationship between the data table identifier stored in the memory and the distribution key is updated according to at least one new distribution key.
  • the bus architecture may include any number of interconnected buses and bridges, specifically linked by one or more processors represented by the processor and various circuits of memory represented by the memory.
  • the bus architecture can also link various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art and, therefore, will not be further described herein.
  • the bus interface provides an interface.
  • the transceiver can be a plurality of components, including a transmitter and a transceiver, providing means for communicating with various other devices on a transmission medium.
  • the processor is responsible for managing the bus architecture and the usual processing, and the memory can store the data that the processor uses when performing operations.
  • the CN determines, according to the data table identifier corresponding to the acquired one data, multiple distribution keys corresponding to the data table identifier; the CN sends the data to the at least one data node according to the multiple distribution keys.
  • each of the at least one data node corresponds to at least one distribution key
  • the storage area identifier of each of the at least one data node indicates that each data node corresponds to At least one distribution key
  • the storage area identifier corresponding to each of the at least one data node is used by each of the at least one data node to store data in a storage area of each of the data nodes.
  • the CN determines the multiple distribution keys corresponding to the data table identifier according to the data table identifier corresponding to the obtained one data, and sends the data to the at least one data node and the storage area identifier corresponding to the at least one data node according to the multiple distribution keys.
  • one piece of data can be stored by multiple distribution keys, and the effective data storage method increases the distribution key corresponding to the data, thereby improving the probability that the keyword used by the user to query the data is a distribution key, and the probability is lowered.
  • the probability of a user performing a query operation such as association on a non-distributed key, thereby reducing the probability of data redistribution between nodes, thereby reducing overhead and network and
  • the consumption of resources such as content improves the efficiency of data query.
  • embodiments of the present invention can be provided as a method, or a computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • a device implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of a flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

涉及数据信息管理技术领域,尤其涉及一种数据存储方法和协调节点CN,用于有效存储数据,以便在查询数据时提升数据查询效率。CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键(201);CN根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,其中,至少一个数据节点中的每个数据节点对应至少一个分布键,至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的至少一个分布键,至少一个数据节点各自对应的存储区标识被至少一个数据节点中的每一个数据节点用于将数据存储在每一个数据节点的存储区(202)。

Description

一种数据存储方法和协调节点
本申请要求在2015年11月27日提交中国专利局、申请号为201510867678.4、发明名称为“一种数据存储方法和协调节点”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及数据信息管理技术领域,尤其涉及一种数据存储方法和协调节点。
背景技术
现有技术中,通常将不同领域的数据存储到不同的数据库中,以实现对数据的管理。不同的数据库组成分布式数据管理系统。分布式数据管理系统中包含至少一个协调节点(Coordinate Node,简称CN),至少一个数据节点(Data Node,简称DN)。任一协调节点和任一数据节点之间可通信,任两个数据节点之间也可通信。
协调节点中包含全局优化查询器,全局优化查询器将各个数据表按照某种规则对数据表进行分区或分片。比如,将数据表中一个列或多个列作为该数据表的分布键的列,进而根据将数据根据一个分布键的列的哈希值进行分区或分片。将分区或分片之后的数据表中的数据根据一个分布键分别存储在多个不同的数据节点中,使每一个数据节点中存储的数据量得到均衡分配。每个数据节点可以根据全局优化查询器的指令管理和操作自身的存储的数据。由于数据库中涵盖的数据表种类比较多,因此在各个数据节点中存储的数据中,各个数据表作为分布键的列可能相同,也可能不同。
在分布式数据管理系统中,用户登录客户端,连接数据库,向协调节点发出数据查询命令,协调节点的全局优化查询器接收到数据查询命令,并对数据查询命令进行解析,生成数据查询计划,并将生成的数据查询计划分发 到每个数据节点。每个数据节点在接收到数据查询计划后,进行数据查询。有关联(join)的相关数据表的分布键相同,则相关联数据就可在相同节点内部进行关联,并将关联操作之后的结果发送至协调节点,从而能够较好提高数据查询效率。
由于现有技术中存储数据的方法,使用户在进行数据查询的过程中,往往要重新分布数据节点中存储的数据,即数据查询过程中,多个数据节点之间需要进行数据的传输以及数据与分布键重新关联,降低了数据查询的效率,增加了网络的负担。
可见,由于现有的数据存储方法导致用户在查询所存储的数据时,以非分布键进行查询的概率加大,导致各个数据节点之间大量的数据传输,为查询过程带来了不必要的开销,如此不足以满足当前高效率查询的需求。
发明内容
本发明实施例提供一种数据存储方法和协调节点,用于有效存储数据,以便匹配当前高效率查询的需求。
本发明实施例提供一种数据存储方法,包括以下步骤:
协调节点CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键;
CN根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,其中,至少一个数据节点中的每个数据节点对应多个分布键中的至少一个分布键,至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的多个分布键中的至少一个分布键,至少一个数据节点各自对应的存储区标识被至少一个数据节点中的每一个数据节点用于将数据存储在每一个数据节点的存储区。
由于CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键,并根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,进而可使一条数据按多个分布键进行存 储,通过该有效的数据存储方法,增加了数据对应的分布键,进而提高了用户查询数据时所使用的关键词为分布键的概率,降低了用户在非分布键上做关联等查询操作的机率,从而降低了各个节点之间进行数据重分布的机率,进而降低了开销以及网络和内容等资源的耗用,提高了数据查询效率。
可选地,CN根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,包括:
CN根据多个分布键,向一个数据节点发送数据和数据节点的存储区标识,存储区标识被数据节点用于将数据存储在数据节点的公共存储区。
如此,避免了将该数据在一个数据节点的多个分布键中的每个分布键上均存储该数据,将数据存储于一个数据节点的公共存储区,节省了数据存储空间,减轻了网络负荷。
可选地,CN根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,包括:
CN根据多个分布键,向多个数据节点发送数据和多个数据节点各自对应的存储区标识,多个数据节点各自对应的存储区标识被多个数据节点中的每一个数据节点用于将数据分别存储在每一个数据节点的至少一个存储区,至少一个存储区为每一个数据节点对应的至少一个分布键各自的私有存储区。
如此,则在查询数据时,可根据所查询的分布键,准确的确定出数据在数据节点上的存储位置,进而提高了数据查询效率。
可选地,还包括:
经过预设时长,CN获取数据表标识对应的至少一条历史查询记录;其中,每条历史查询记录中包括关联键,关联键表示历史查询记录中查询数据所使用的关键词;
CN根据至少一条历史查询记录,确定出至少一条历史查询记录中出现频率大于阈值的至少一个关联键;
CN将确定出的至少一个关联键作为数据表标识对应的至少一个新的分布键;
CN根据至少一个新的分布键,更新数据表标识与分布键的对应关系。
由于根据历史查询记录确定出用户查询数据时所使用的关键词,进而将确定出的关键词作为新的分布键,之后,进一步提高了用户查询数据时所使用的关键词为分布键的概率,降低了用户在非分布键上做关联等查询操作的机率,从而降低了各个节点之间进行数据重分布的机率,进而降低了开销以及网络和内容等资源的耗用,提高了数据查询效率。
本发明实施例提供一种协调节点,包括:
确定单元,用于根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键;
发送单元,用于根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,其中,至少一个数据节点中的每个数据节点对应多个分布键中的至少一个分布键,至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的多个分布键中的至少一个分布键,至少一个数据节点各自对应的存储区标识被至少一个数据节点中的每一个数据节点用于将数据存储在每一个数据节点的存储区。
由于CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键,并根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,进而可使一条数据按多个分布键进行存储,通过该有效的数据存储方法,增加了数据对应的分布键,进而提高了用户查询数据时所使用的关键词为分布键的概率,降低了用户在非分布键上做关联等查询操作的机率,从而降低了各个节点之间进行数据重分布的机率,进而降低了开销以及网络和内容等资源的耗用,提高了数据查询效率。
可选地,发送单元,具体用于:
根据多个分布键,向一个数据节点发送数据和数据节点的存储区标识,存储区标识被数据节点用于将数据存储在数据节点的公共存储区。
如此,避免了将该数据在一个数据节点的多个分布键中的每个分布键上均存储该数据,将数据存储于一个数据节点的公共存储区,节省了数据存储 空间,减轻了网络负荷。
可选地,发送单元,具体用于:
根据多个分布键,向多个数据节点发送数据和多个数据节点各自对应的存储区标识,多个数据节点各自对应的存储区标识被多个数据节点中的每一个数据节点用于将数据分别存储在每一个数据节点的至少一个存储区,至少一个存储区为每一个数据节点对应的至少一个分布键各自的私有存储区。
如此,则在查询数据时,可根据所查询的分布键,准确的确定出数据在数据节点上的存储位置,进而提高了数据查询效率。
可选地,还包括处理单元,用于:
经过预设时长,获取数据表标识对应的至少一条历史查询记录;其中,每条历史查询记录中包括关联键,关联键表示历史查询记录中查询数据所使用的关键词;
根据至少一条历史查询记录,确定出至少一条历史查询记录中出现频率大于阈值的至少一个关联键;
将确定出的至少一个关联键作为数据表标识对应的至少一个新的分布键;
根据至少一个新的分布键,更新数据表标识与分布键的对应关系。
由于根据历史查询记录确定出用户查询数据时所使用的关键词,进而将确定出的关键词作为新的分布键,之后,进一步提高了用户查询数据时所使用的关键词为分布键的概率,降低了用户在非分布键上做关联等查询操作的机率,从而降低了各个节点之间进行数据重分布的机率,进而降低了开销以及网络和内容等资源的耗用,提高了数据查询效率。
本发明实施例提供一种协调节点,包括:
存储器,用于存储数据表标识与多个分布键的对应关系,并向处理器提供数据表标识与多个分布键的对应关系;
处理器,用于根据获取的一条数据对应的数据表标识,通过存储器确定出数据表标识对应的多个分布键;
以及用于根据多个分布键,通过收发器向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,其中,至少一个数据节点中的每个数据节点对应多个分布键中的至少一个分布键,至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的多个分布键中的至少一个分布键,至少一个数据节点各自对应的存储区标识被至少一个数据节点中的每一个数据节点用于将数据存储在每一个数据节点的存储区;
收发器,用于向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识。
由于CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键,并根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,进而可使一条数据按多个分布键进行存储,通过该有效的数据存储方法,增加了数据对应的分布键,进而提高了用户查询数据时所使用的关键词为分布键的概率,降低了用户在非分布键上做关联等查询操作的机率,从而降低了各个节点之间进行数据重分布的机率,进而降低了开销以及网络和内容等资源的耗用,提高了数据查询效率。
可选地,处理器,具体用于:
根据多个分布键,通过收发器向一个数据节点发送数据和数据节点的存储区标识,存储区标识被数据节点用于将数据存储在数据节点的公共存储区;
相应的,收发器具体用于向一个数据节点发送数据和数据节点的存储区标识。
如此,避免了将该数据在一个数据节点的多个分布键中的每个分布键上均存储该数据,将数据存储于一个数据节点的公共存储区,节省了数据存储空间,减轻了网络负荷。
可选地,处理器,具体用于:
根据多个分布键,通过收发器向多个数据节点发送数据和多个数据节点各自对应的存储区标识,多个数据节点各自对应的存储区标识被多个数据节点中的每一个数据节点用于将数据分别存储在每一个数据节点的至少一个存 储区,至少一个存储区为每一个数据节点对应的至少一个分布键各自的私有存储区;
相应的,收发器具体用于向多个数据节点发送数据和多个数据节点各自对应的存储区标识。
如此,则在查询数据时,可根据所查询的分布键,准确的确定出数据在数据节点上的存储位置,进而提高了数据查询效率。
可选地,处理器,还用于:
经过预设时长,获取数据表标识对应的至少一条历史查询记录;其中,每条历史查询记录中包括关联键,关联键表示历史查询记录中查询数据所使用的关键词;
根据至少一条历史查询记录,确定出至少一条历史查询记录中出现频率大于阈值的至少一个关联键;
将确定出的至少一个关联键作为数据表标识对应的至少一个新的分布键;
根据至少一个新的分布键,更新存储器中存储的数据表标识与分布键的对应关系。
由于根据历史查询记录确定出用户查询数据时所使用的关键词,进而将确定出的关键词作为新的分布键,之后,进一步提高了用户查询数据时所使用的关键词为分布键的概率,降低了用户在非分布键上做关联等查询操作的机率,从而降低了各个节点之间进行数据重分布的机率,进而降低了开销以及网络和内容等资源的耗用,提高了数据查询效率。
本发明实施例中,CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键;CN根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,其中,至少一个数据节点中的每个数据节点对应至少一个分布键,至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的至少一个分布键,至少一个数据节点各自对应的存储区标识被至少一个数据节点中的每一个数据节点用于将数据存储在每一个数据节点的存储区。由于CN根据获取的一条数据对应的数据表 标识,确定出数据表标识对应的多个分布键,并根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,进而可使一条数据按多个分布键进行存储,通过该有效的数据存储方法,增加了数据对应的分布键,进而提高了用户查询数据时所使用的关键词为分布键的概率,降低了用户在非分布键上做关联等查询操作的机率,从而降低了各个节点之间进行数据重分布的机率,进而降低了开销以及网络和内容等资源的耗用,提高了数据查询效率,与当前高效率查询的需求相匹配。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例适用的分布式数据管理系统架构示意图;
图2为本发明实施例提供的在协调节点侧实现的一种数据管理方法的流程示意图;
图2a为本发明实施例适用的一种数据节点内部数据存储的结构示意图;
图2b为本发明实施例对数据节点中的数据进行查询时的查询方法示意图;
图3为本发明实施例提供的一种协调节点的的结构示意图;
图4为本发明实施例提供的另一种协调节点的的结构示意图。
具体实施方式
为了使本发明的目的、技术方案及有益效果更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
本发明实施例中的标识,用于标识一个对象,对象可以是:数据表、存储区,比如本发明实施例中的数据表标识、存储区标识等。一个标识可包括 名称、编号、ID(Identification)中的至少一项,只要能够将被标识的对象与其他对象区别开即可。
本发明实施例中数据表表示由对应同一组分布键的一组数据所组成的表,一个数据表标识所对应的所有数据均对应同一组分布键,该同一组分布键中包括多个分布键。
图1示例性示出了本发明实施例适用的分布式数据管理系统架构示意图。如图1所示,该系统中包含至少一个协调节点102,至少一个数据节点101。任一协调节点102和任一数据节点101之间可通信,任两个数据节点101之间也可通信。用户通过网络设备103输入数据,协调节点102与网络设备103之间交互,从网络设备中获取数据等信息。协调节点对一条或多条数据中一个列或多个列作为该数据表的分布键,进而根据该分布键对该一条或多条数据进行哈希,得到每条数据对应的哈希值,根据该哈希值,将每条数据存储于数据节点中,从而使每一个数据节点中存储的数据量得到均衡分配。
现有技术中,数据都以一个分布键进行存储。举个例子,比如,表A的分布键为aa,表B的分布键为aa,表C的分布键为bb。当对表A和表B中的数据进行A.aa=B.aa的等值关联时,表A和表B的关联键均为aa,即表A和表B均在分布键aa上进行关联,因此,表A和表B在同一个节点内即可做关联。当对表B和表C进行B.bb=C.aa的等值关联时,表B的关联键为bb,表C的关联键为aa,由于表B和表C的分布键均与其关联键不同,因此,数据节点将对表C以关联键aa进行重新分布,同时对表B以关联键bb进行重新分布,以便实现可在同一个节点内做关联的目的。
基于上述描述可看出,现有技术中,由于数据仅仅是根据一个分布键进行存储,此时,用户查询数据时所使用的关键词为该数据对应的分布键的概率较小,用户在非分布键上做关联等查询操作的机率非常大,从而各个节点之间进行数据重分布的机率也较大,可见,使用现有技术中将数据以一个分布键进行存储,之后对数据进行查询的过程中网络开销较大,资源耗用较多,且查询效率较低。
基于上述描述,以及图1所示的系统架构,图2示出了本发明实施例提供的在CN侧实现的一种数据存储方法的流程示意图,包括以下步骤:
步骤201,协调节点CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键;
步骤202,CN根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,其中,至少一个数据节点中的每个数据节点对应多个分布键中的至少一个分布键,至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的多个分布键中的至少一个分布键,至少一个数据节点各自对应的存储区标识被至少一个数据节点中的每一个数据节点用于将数据存储在每一个数据节点的存储区。
由于CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键,并根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,进而可使一条数据按多个分布键进行存储,通过该有效的数据存储方法,增加了数据对应的分布键,进而提高了用户查询数据时所使用的关键词为分布键的概率,降低了用户在非分布键上做关联等查询操作的机率,从而降低了各个节点之间进行数据重分布的机率,进而降低了开销以及网络和内容等资源的耗用,提高了数据查询效率。
其中,至少一个数据节点中的每个数据节点对应多个分布键中的至少一个分布键。举例来说,存在第一数据节点、第二数据节点、第三数据节点和第四数据节点,第一分布键对应第一数据节点和第二数据节点,即,数据按第一分布键进行存储时,可存储于第一数据节点或第二数据节点;第二分布键可对应第二数据节点和第三数据节点,即,数据按第二分布键进行存储时,可存储于第二数据节点或第三数据节点。
至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的多个分布键中的至少一个分布键。一种可选地实施方式为,至少一个数据节点中每个数据节点的存储区标识为每个数据节点对应的多个分布键中的至 少一个分布键,另一种可选的实施方式为,至少一个数据节点中每个数据节点的存储区标识为:每个数据节点对应的多个分布键中的至少一个分布键对应的存储区的标识。举例来说,分布键为“班级”,则可将数据节点的存储区划分为至少一个区域,其中存在一个分布键为“班级”所对应的存储区,分布键为“班级”所对应的存储区的标识可为“001”等能够标识出该存储区的标识,或者分布键为“班级”所对应的存储区的标识为分布键“班级”。
上述步骤201之前,用户通过创建表格(create table)的结构化查询语句(Structured Query Language,简称SQL)语句建数据表,数据表对应一个数据表标识,进而为所创建的数据表指定多个分布键,并将数据标识对应的分布键的信息存储于CN中,或者存储于CN连接的其它设备中,以使CN能够从CN所连接的其它设备中获取数据标识对应的分布键的信息。可选地,CN中预设有数据表标识与分布键的对应关系。一种可选的实施方式为,CN根据SQL语句将数据表标识与分布键的对应关系预先存储于CN中,一个数据表标识可对应一个或多个分布键,一个数据表标识对应的多个分布键是不同的,其中,任一个分布键可为该数据表标识对应的数据的多个列中的任一列或任几列的组合。本发明实施例对此不做限制,以下内容仅以一个数据为例进行介绍。
具体实施中,用户通过系统为用户提供界面录入一条或多条数据,每条数据均包括多列,用户将数据录入系统之后,CN会实时或者周期性的获取用户所录入系统的数据,或者通过插入(insert)语句插入数据,也可以通过复制(copy)命令或其它工具批量导入数据。无论通过用户录入、插入还是导入的数据,CN均将所获取的数据按条进行处理,且CN可获取每一条数据所应存储的数据表的标识,即CN可确定出一条数据对应的数据表标识。CN获取到一条数据之后,确定出该条数据对应的数据表标识,以及该数据表标识对应的多个分布键,根据该分布键对该数据进行存储。
上述步骤202中,根据多个分布键中的每个分布键,确定出的数据根据每个分布键所应存储的数据节点。可选地,根据该数据对应的数据表标识对 应的每个分布键,根据该分布键对数据进行哈希,得到该数据对应的哈希值,并根据确定出的哈希值确定该数据应该存储的数据节点。本发明实施例可有多种实现方式,本发明实施例中提供以下几种可选地实施方式。
方式一,若根据多个分布键确定出的数据应存储的数据节点为同一个,则CN根据多个分布键,向一个数据节点发送数据和数据节点的存储区标识,存储区标识被数据节点用于将数据存储在数据节点的公共存储区。
可选地,若根据多个分布键确定出的数据应存储的数据节点为多个,则CN根据多个分布键,向多个数据节点发送数据和多个数据节点各自对应的存储区标识;其中,多个数据节点各自对应的存储区标识被多个数据节点中的每一个数据节点用于将数据分别存储在每一个数据节点的至少一个存储区,至少一个存储区为每一个数据节点对应的至少一个分布键各自的私有存储区。
方式二,针对多个分布键中的每个分布键,根据该分布键确定出数据应存储的数据节点,以及该数据在该数据节点对应的存储区标识,CN将数据和数据在该数据节点对应的存储区标识发送给该数据节点;其中,存储区标识为该分布键对应的私有存储区的标识,数据节点根据该存储区标识,将数据存储于该分布键对应的私有存储区;若依次根据所有的分布键将数据均存储于同一个节点上,则将该数据存储于该数据节点的所有分布键均对应的公共存储区,并将该数据节点上其它存储区所存储的该数据删除。
图2a示例性示出了本发明实施例适用的一种数据节点内部数据存储的结构示意图。如图2a所示,一个协调节点2101连接三个数据节点,分别为数据节点1、数据节点2、数据节点3。该三个数据节点中的每个数据节点中的数据存储结构类似,以数据节点1为例进行介绍,当数据节点1所储存的数据表标识对应的分布键为N个时,分别为分布键1、分布键2、…、分布键N,其中,N为大于1的整数,数据节点1内部包括(N+1)个存储区,分别为分布键1对应的私有存储区2102、分布键2对应的私有存储区2103、…、分布键N对应的私有存储区2104,以及一个所有分布键均对应的公共存储区2105。
结合上述方式一举例,CN获取到一条数据之后,获取该条数据对应的数 据表标识,根据预设的数据表标识与分布键的对应关系,以及获取的该条数据对应的数据表标识,确定出数据表标识对应的多个分布键。根据多个分布键中的每个分布键,确定出的数据应存储的数据节点。
若根据多个分布键确定出的数据应存储的数据节点为同一个,则CN根据多个分布键,向该确定出的一个数据节点发送数据和数据节点的存储区标识,此时,数据节点的存储区标识为该数据节点上与所有多个分布键均对应的公共存储区的标识。比如,对数据根据分布键1进行哈希,并得到哈希值之后,根据该哈希值确定出该数据应存储在数据节点1中;对数据根据分布键2进行哈希,根据得到的哈希值确定出该数据应存储在数据节点1中;对数据根据N个分布键中的每个分布键均进行哈希,确定出对数据根据N个分布键中的每个分布键进行哈希时,该数据均应存储在数据节点1中。此时,CN将该数据以及数据节点1的公共存储区的标识发送给数据节点1,以使数据节点1将该数据存储在数据节点1的公共存储区。
若根据多个分布键确定出的数据应存储的数据节点为多个,则CN根据多个分布键,向多个数据节点发送数据和多个数据节点各自对应的存储区标识。比如,当N为4时,对数据根据分布键1进行哈希,根据得到的哈希值确定出该数据应存储在数据节点1中,对数据根据分布键2进行哈希,根据得到的哈希值确定出该数据应存储在数据节点2中,对数据根据分布键3进行哈希,根据得到的哈希值确定出该数据应存储在数据节点3中,对数据根据分布键4进行哈希,根据得到的哈希值确定出该数据应存储在数据节点1中。此时,CN向数据节点1发送数据,以及分布键1对应的私有存储区的标识、分布键4对应的私有存储区的标识,以使数据节点1将该数据存储在数据节点1的分布键1对应的私有存储区,而且也将该数据也存储在数据节点1的分布键4对应的私有存储区;CN向数据节点2发送数据,以及分布键2对应的私有存储区的标识,以使数据节点2并将该数据存储在数据节点2的分布键2对应的私有存储区;CN向数据节点3发送数据,以及分布键3对应的私有存储区的标识,以使数据节点3将该数据存储在数据节点3的分布键3对 应的私有存储区。
上述方式二中,具体以N为4举例。比如,将数据根据分布键1进行哈希,并得到哈希值之后,根据该哈希值确定出该数据应存储在数据节点1中,此时,CN向数据节点1发送数据,以及分布键1对应的私有存储区的标识,以使数据节点1将该数据存储在数据节点1中分布键1对应的私有存储区。根据分布键2确定出该数据应存储在数据节点1中,此时,CN向数据节点1发送数据,以及分布键1对应的私有存储区的标识,以使数据节点1将该数据存储在数据节点1中分布键2对应的私有存储区。根据分布键3确定出该数据应存储在数据节点1中,此时,CN向数据节点1发送数据,以及分布键1对应的私有存储区的标识,以使数据节点1将该数据存储在数据节点1中分布键3对应的私有存储区。根据分布键4确定出该数据应存储在数据节点1中,此时,CN向数据节点1发送数据,以及分布键1对应的私有存储区的标识,以使数据节点1将该数据存储在数据节点1中分布键4对应的私有存储区。此时,确定根据四个分布键确定出的数据应存储的数据节点为同一个时,均为数据节点1,则将数据存储在该数据节点1中的公共存储区,并删除该数据节点中分布键1对应的私有存储区所存储的该数据、删除分布键2对应的私有存储区所存储的该数据、删除分布键3对应的私有存储区所存储的该数据,以及删除分布键4对应的私有存储区所存储的该数据。
另一个示例中,将数据根据分布键1进行哈希,并得到哈希值之后,根据该哈希值确定出该数据应存储在数据节点1中,此时,CN向数据节点1发送数据,以及分布键1对应的私有存储区的标识,以使数据节点1将该数据存储在数据节点1中分布键1对应的私有存储区。根据分布键2确定出该数据应存储在数据节点2中,此时,CN向数据节点2发送数据,以及分布键2对应的私有存储区的标识,以使数据节点2将该数据存储在数据节点2中分布键2对应的私有存储区。根据分布键3确定出该数据应存储在数据节点3中,此时,CN向数据节点3发送数据,以及分布键3对应的私有存储区的标识,以使数据节点3将该数据存储在数据节点3中分布键3对应的私有存储 区。根据分布键4确定出该数据应存储在数据节点1中,此时,CN向数据节点1发送数据,以及分布键4对应的私有存储区的标识,以使数据节点1将该数据存储在数据节点1中分布键4对应的私有存储区。此时,确定根据四个分布键确定出的数据应存储的数据节点为多个,分布为数据节点1、数据节点2和数据节点3。
可选地,在初始设置数据表标识对应的分布键时,可为该数据表标识对应设置一个分布键,也可设置多个分布键,本领域技术人员可根据经验进行设置。可选地,经过预设时长,比如几周或几个月,经过预设时长,CN获取数据表标识对应的至少一条历史查询记录;其中,每条历史查询记录中包括关联键,关联键表示历史查询记录中查询数据所使用的关键词;CN根据至少一条历史查询记录,确定出至少一条历史查询记录中出现频率大于阈值的至少一个关联键;CN将确定出的至少一个关联键作为数据表标识对应的至少一个新的分布键;CN根据至少一个新的分布键,更新数据表标识与分布键的对应关系。具体来说,历史查询记录中包括等值关联条件,等值关联条件中包括关联键。
举个例子,为一个数据表标识对应设置分布键A,之后经过预设时长,获取了五条历史查询记录,历史查询记录可为SQL语句。此时查找历史查询记录中的等值关联条件,确定出关联键,并将出现的关联键作为新的分布键。可选地,确定关联键B的出现频率大于阈值,则将该关联键B设置为新的分布键,更新预设的数据表标识与分布键的对应关系,更新后,该数据表标识对应分布键B。可选地,CN通过接收到的用户发出的命令,或者CN自动以该更新后的分布键B对数据节点中的数据进行重新分布。也有一种可能的实现方式为,数据表标识原来对应的是分布键A,对历史查询记录进行分析后,发现关联键A、关联键B和关联键C的出现频率均大于阈值,此时,将关联键A、关联键B和关联键C均设置为新的分布键,更新后的数据表标识与分布键的对应关系为,该数据表标识对应分布键A、分布键B和分布键C。CN以该更新后的分布键B和分布键C对数据节点中的数据进行重新分布。重新 分布的过程具体为CN向各个数据节点发送重新分布命令,各个节点依据该新的分布键B和分布键C判断自身存储的数据哪些是自己的,哪些是属于其它数据节点的,并将属于其它的数据节点存储的数据发送给其它数据节点。比如,数据节点1确定出自身存储的一条数据应存储于数据节点4中,则数据节点1将该数据发送给数据节点4,数据节点4接收该数据并存储。
对数据以分布键A、分布键B和分布键C对数据表TB_SVC_SUBS_HIST做数据分布的SQL语句可为:
create table TB_SVC_SUBS_HIST(…)distribute by(分布键A),(分布键B),(分布键C)
通过上述方法对数据进行管理之后,用户在查询时,登录客户端,连接数据库,发出数据查询命令。CN的全局优化查询器接收到数据查询命令,并将数据查询命令进行解析,生成数据查询计划,并将生成的数据查询计划分发到每个数据节点,每个数据节点在接收到数据查询计划后,进行数据查询。
针对任一数据节点,该数据节点对于接收到的数据查询计划,进行哈希运算,根据哈希运算得到的结果,确定该数据所存储的位置,比如数据节点1确定出该数据在分布键1对应的私有存储区,在分布键4对应的私有存储区,或者在该数据节点的所有分布键均对应的公共存储区等。进而在该确定出的该数据所存储的位置扫描出数据。
本发明实施例中,由于每个数据对应的数据表标识可对应多个分布键,因此查询计划中所查询的关联键为分布键的机率较大,待关联的数据在相同的数据节点的机率也加大,如果确定出两个数据存储在相同的数据节点,则在每个存储数据的数据节点内,对数据进行关联操作,每个数据节点将得到的关联结果发送给CN,CN将关联结果整理合并后,将查询到的数据返回给客户端,呈现给用户。如果确定出两个数据存储在不同的数据节点,则数据节点之间进行数据传输,以使两个数据存储在相同的数据节点内。
图2b示意性示出了本发明实施例对数据节点中的数据进行查询时的查询方法示意图。如图2b所示,数据1以分布键A和分布B在数据节点1、数据 节点2中存储,数据2以分布键A分布,数据3以分布键B分布。当用户需要以分布键A对数据1和数据2进行关联时,以分布键B对数据1和数据3进行关联,则以数据节点1为例,数据节点1查询数据节点1中分布键A对应的私有存储区和公共存储区,在数据节点1内对数据1和数据2进行关联;数据节点1查询数据节点1中分布键B对应的私有存储区和公共存储区,在数据节点1内对数据1和数据3进行关联。数据节点2和数据节点3均执行上述类似方法流程,在此不再赘述。
对数据1和数据2以分布键A进行关联的SQL语句如下,数据1为TB_SVC-SUBS-HIST,数据2为DIL-BILL,分布键A为SUBS-ID:
Select count(*)fromTB_SVC-SUBS-HIST JOIN DIL-BILL on TB_SVC-SUBS-HIST SUBS-ID=DIL-BILL SUBS-ID
对数据1和数据3以分布键B进行关联的SQL语句如下,数据1为TB_SVC-SUBS-HIST,数据3为GPRS-CDR,分布键B为MSISDN:
Select count(*)fromTB_SVC-SUBS-HIST JOIN GPRS-CDR on TB_SVC-SUBS-HIST MSISDN=GPRS-CDR MSISDN
从上述内容可看出:本发明实施例中,CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键;CN根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,其中,至少一个数据节点中的每个数据节点对应至少一个分布键,至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的至少一个分布键,至少一个数据节点各自对应的存储区标识被至少一个数据节点中的每一个数据节点用于将数据存储在每一个数据节点的存储区。由于CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键,并根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,进而可使一条数据按多个分布键进行存储,通过该有效的数据存储方法,增加了数据对应的分布键,进而提高了用户查询数据时所使用的关键词为分布键的概率,降低了用户在非分布键上做关联等查询操作的机率,从 而降低了各个节点之间进行数据重分布的机率,进而降低了开销以及网络和内容等资源的耗用,提高了数据查询效率。
图3示例性示出了本发明实施例提供的一种协调节点的结构示意图。
基于相同构思,本发明实施例提供的一种协调节点300,用于执行上述方法流程,如图3所示,包括确定单元301、发送单元302,可选地,还包括处理单元303:
确定单元,用于根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键;
发送单元,用于根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,其中,至少一个数据节点中的每个数据节点对应多个分布键中的至少一个分布键,至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的多个分布键中的至少一个分布键,至少一个数据节点各自对应的存储区标识被至少一个数据节点中的每一个数据节点用于将数据存储在每一个数据节点的存储区。
可选地,发送单元,具体用于:
根据多个分布键,向一个数据节点发送数据和数据节点的存储区标识,存储区标识被数据节点用于将数据存储在数据节点的公共存储区。
可选地,发送单元,具体用于:
根据多个分布键,向多个数据节点发送数据和多个数据节点各自对应的存储区标识,多个数据节点各自对应的存储区标识被多个数据节点中的每一个数据节点用于将数据分别存储在每一个数据节点的至少一个存储区,至少一个存储区为每一个数据节点对应的至少一个分布键各自的私有存储区。
可选地,还包括处理单元,用于:
经过预设时长,获取数据表标识对应的至少一条历史查询记录;其中,每条历史查询记录中包括关联键,关联键表示历史查询记录中查询数据所使用的关键词;
根据至少一条历史查询记录,确定出至少一条历史查询记录中出现频率大于阈值的至少一个关联键;
将确定出的至少一个关联键作为数据表标识对应的至少一个新的分布键;
根据至少一个新的分布键,更新数据表标识与分布键的对应关系。
从上述内容可看出:本发明实施例中,CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键;CN根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,其中,至少一个数据节点中的每个数据节点对应至少一个分布键,至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的至少一个分布键,至少一个数据节点各自对应的存储区标识被至少一个数据节点中的每一个数据节点用于将数据存储在每一个数据节点的存储区。由于CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键,并根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,进而可使一条数据按多个分布键进行存储,通过该有效的数据存储方法,增加了数据对应的分布键,进而提高了用户查询数据时所使用的关键词为分布键的概率,降低了用户在非分布键上做关联等查询操作的机率,从而降低了各个节点之间进行数据重分布的机率,进而降低了开销以及网络和内容等资源的耗用,提高了数据查询效率。
图4示例性示出了本发明实施例提供的另一种协调节点的结构示意图。
基于相同构思,本发明实施例提供的一种协调节点400,用于执行上述方法流程,如图4所示,包括处理器401、收发器403、存储器402:
存储器,用于存储数据表标识与多个分布键的对应关系、指令,以及处理器在执行指令过程中的相关数据,并向处理器提供数据表标识与多个分布键的对应关系和指令;
处理器,用于读取存储器中的指令,执行下列过程:
根据获取的一条数据对应的数据表标识,通过存储器确定出数据表标识对应的多个分布键;
以及用于根据多个分布键,通过收发器向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,其中,至少一个数据节点中的每个数据节点对应多个分布键中的至少一个分布键,至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的多个分布键中的至少一个分布键,至少一个数据节点各自对应的存储区标识被至少一个数据节点中的每一个数据节点用于将数据存储在每一个数据节点的存储区;
收发器,用于向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识。可选地,收发器还用于向另一个协调节点发送信令和数据、接收另一个协调节点发送的信令和数据,以及向数据节点信令,接收数据节点发送的数据等等。
可选地,处理器,具体用于:
根据多个分布键,通过收发器向一个数据节点发送数据和数据节点的存储区标识,存储区标识被数据节点用于将数据存储在数据节点的公共存储区;
相应的,收发器具体用于向一个数据节点发送数据和数据节点的存储区标识。
可选地,处理器,具体用于:
根据多个分布键,通过收发器向多个数据节点发送数据和多个数据节点各自对应的存储区标识,多个数据节点各自对应的存储区标识被多个数据节点中的每一个数据节点用于将数据分别存储在每一个数据节点的至少一个存储区,至少一个存储区为每一个数据节点对应的至少一个分布键各自的私有存储区;
相应的,收发器具体用于向多个数据节点发送数据和多个数据节点各自对应的存储区标识。
可选地,处理器,还用于:
经过预设时长,获取数据表标识对应的至少一条历史查询记录;其中, 每条历史查询记录中包括关联键,关联键表示历史查询记录中查询数据所使用的关键词;
根据至少一条历史查询记录,确定出至少一条历史查询记录中出现频率大于阈值的至少一个关联键;
将确定出的至少一个关联键作为数据表标识对应的至少一个新的分布键;
根据至少一个新的分布键,更新存储器中存储的数据表标识与分布键的对应关系。
其中,总线架构可以包括任意数量的互联的总线和桥,具体由处理器代表的一个或多个处理器和存储器代表的存储器的各种电路链接在一起。总线架构还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路链接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口提供接口。收发器可以是多个元件,即包括发送机和收发器,提供用于在传输介质上与各种其他装置通信的单元。处理器负责管理总线架构和通常的处理,存储器可以存储处理器在执行操作时所使用的数据。
从上述内容可看出:本发明实施例中,CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键;CN根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,其中,至少一个数据节点中的每个数据节点对应至少一个分布键,至少一个数据节点中每个数据节点的存储区标识表示每个数据节点对应的至少一个分布键,至少一个数据节点各自对应的存储区标识被至少一个数据节点中的每一个数据节点用于将数据存储在每一个数据节点的存储区。由于CN根据获取的一条数据对应的数据表标识,确定出数据表标识对应的多个分布键,并根据多个分布键,向至少一个数据节点发送数据和至少一个数据节点各自对应的存储区标识,进而可使一条数据按多个分布键进行存储,通过该有效的数据存储方法,增加了数据对应的分布键,进而提高了用户查询数据时所使用的关键词为分布键的概率,降低了用户在非分布键上做关联等查询操作的机率,从而降低了各个节点之间进行数据重分布的机率,进而降低了开销以及网络和 内容等资源的耗用,提高了数据查询效率。
本领域内的技术人员应明白,本发明的实施例可提供为方法、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的设备。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令设备的制造品,该指令设备实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本 发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (12)

  1. 一种数据存储方法,其特征在于,包括以下步骤:
    协调节点CN根据获取的一条数据对应的数据表标识,确定出所述数据表标识对应的多个分布键;
    所述CN根据所述多个分布键,向至少一个数据节点发送所述数据和所述至少一个数据节点各自对应的存储区标识,其中,所述至少一个数据节点中的每个数据节点对应所述多个分布键中的至少一个分布键,所述至少一个数据节点中每个数据节点的存储区标识表示所述每个数据节点对应的所述多个分布键中的至少一个分布键,所述至少一个数据节点各自对应的存储区标识被所述至少一个数据节点中的每一个数据节点用于将所述数据存储在所述每一个数据节点的存储区。
  2. 如权利要求1所述的方法,其特征在于,所述CN根据所述多个分布键,向至少一个数据节点发送所述数据和所述至少一个数据节点各自对应的存储区标识,包括:
    所述CN根据所述多个分布键,向一个数据节点发送所述数据和所述数据节点的存储区标识,所述存储区标识被所述数据节点用于将所述数据存储在所述数据节点的公共存储区。
  3. 如权利要求1所述的方法,其特征在于,所述CN根据所述多个分布键,向至少一个数据节点发送所述数据和所述至少一个数据节点各自对应的存储区标识,包括:
    所述CN根据所述多个分布键,向多个数据节点发送所述数据和所述多个数据节点各自对应的存储区标识,所述多个数据节点各自对应的存储区标识被所述多个数据节点中的每一个数据节点用于将所述数据分别存储在所述每一个数据节点的至少一个存储区,所述至少一个存储区为所述每一个数据节点对应的至少一个分布键各自的私有存储区。
  4. 如权利要求1至3任一权利要求所述的方法,其特征在于,还包括:
    经过预设时长,所述CN获取所述数据表标识对应的至少一条历史查询记录;其中,每条历史查询记录中包括关联键,所述关联键表示所述历史查询记录中查询数据所使用的关键词;
    所述CN根据所述至少一条历史查询记录,确定出所述至少一条历史查询记录中出现频率大于阈值的至少一个关联键;
    所述CN将确定出的所述至少一个关联键作为所述数据表标识对应的至少一个新的分布键;
    所述CN根据所述至少一个新的分布键,更新所述数据表标识与分布键的对应关系。
  5. 一种协调节点,其特征在于,包括:
    确定单元,用于根据获取的一条数据对应的数据表标识,确定出所述数据表标识对应的多个分布键;
    发送单元,用于根据所述多个分布键,向至少一个数据节点发送所述数据和所述至少一个数据节点各自对应的存储区标识,其中,所述至少一个数据节点中的每个数据节点对应所述多个分布键中的至少一个分布键,所述至少一个数据节点中每个数据节点的存储区标识表示所述每个数据节点对应的所述多个分布键中的至少一个分布键,所述至少一个数据节点各自对应的存储区标识被所述至少一个数据节点中的每一个数据节点用于将所述数据存储在所述每一个数据节点的存储区。
  6. 如权利要求5所述的协调节点,其特征在于,所述发送单元,具体用于:
    根据所述多个分布键,向一个数据节点发送所述数据和所述数据节点的存储区标识,所述存储区标识被所述数据节点用于将所述数据存储在所述数据节点的公共存储区。
  7. 如权利要求5所述的协调节点,其特征在于,所述发送单元,具体用于:
    根据所述多个分布键,向多个数据节点发送所述数据和所述多个数据节 点各自对应的存储区标识,所述多个数据节点各自对应的存储区标识被所述多个数据节点中的每一个数据节点用于将所述数据分别存储在所述每一个数据节点的至少一个存储区,所述至少一个存储区为所述每一个数据节点对应的至少一个分布键各自的私有存储区。
  8. 如权利要求5至7任一权利要求所述的协调节点,其特征在于,还包括处理单元,用于:
    经过预设时长,获取所述数据表标识对应的至少一条历史查询记录;其中,每条历史查询记录中包括关联键,所述关联键表示所述历史查询记录中查询数据所使用的关键词;
    根据所述至少一条历史查询记录,确定出所述至少一条历史查询记录中出现频率大于阈值的至少一个关联键;
    将确定出的所述至少一个关联键作为所述数据表标识对应的至少一个新的分布键;
    根据所述至少一个新的分布键,更新所述数据表标识与分布键的对应关系。
  9. 一种协调节点,其特征在于,包括:
    存储器,用于存储数据表标识与多个分布键的对应关系,并向所述处理器提供所述数据表标识与多个分布键的对应关系;
    处理器,用于根据获取的一条数据对应的数据表标识,通过所述存储器确定出所述数据表标识对应的多个分布键;
    以及用于根据所述多个分布键,通过所述收发器向至少一个数据节点发送所述数据和所述至少一个数据节点各自对应的存储区标识,其中,所述至少一个数据节点中的每个数据节点对应所述多个分布键中的至少一个分布键,所述至少一个数据节点中每个数据节点的存储区标识表示所述每个数据节点对应的所述多个分布键中的至少一个分布键,所述至少一个数据节点各自对应的存储区标识被所述至少一个数据节点中的每一个数据节点用于将所述数据存储在所述每一个数据节点的存储区;
    所述收发器,用于向至少一个数据节点发送所述数据和所述至少一个数据节点各自对应的存储区标识。
  10. 如权利要求9所述的协调节点,其特征在于,所述处理器,具体用于:
    根据所述多个分布键,通过所述收发器向一个数据节点发送所述数据和所述数据节点的存储区标识,所述存储区标识被所述数据节点用于将所述数据存储在所述数据节点的公共存储区;
    相应的,所述收发器具体用于向一个数据节点发送所述数据和所述数据节点的存储区标识。
  11. 如权利要求9所述的协调节点,其特征在于,所述处理器,具体用于:
    根据所述多个分布键,通过所述收发器向多个数据节点发送所述数据和所述多个数据节点各自对应的存储区标识,所述多个数据节点各自对应的存储区标识被所述多个数据节点中的每一个数据节点用于将所述数据分别存储在所述每一个数据节点的至少一个存储区,所述至少一个存储区为所述每一个数据节点对应的至少一个分布键各自的私有存储区;
    相应的,所述收发器具体用于向多个数据节点发送所述数据和所述多个数据节点各自对应的存储区标识。
  12. 如权利要求9至11任一权利要求所述的协调节点,其特征在于,所述处理器,还用于:
    经过预设时长,获取所述数据表标识对应的至少一条历史查询记录;其中,每条历史查询记录中包括关联键,所述关联键表示所述历史查询记录中查询数据所使用的关键词;
    根据所述至少一条历史查询记录,确定出所述至少一条历史查询记录中出现频率大于阈值的至少一个关联键;
    将确定出的所述至少一个关联键作为所述数据表标识对应的至少一个新的分布键;
    根据所述至少一个新的分布键,更新所述存储器中存储的所述数据表标识与分布键的对应关系。
PCT/CN2016/105243 2015-11-27 2016-11-09 一种数据存储方法和协调节点 WO2017088666A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP16867890.2A EP3373158B1 (en) 2015-11-27 2016-11-09 Data storage method and coordinator node
US15/989,315 US20180276252A1 (en) 2015-11-27 2018-05-25 Data Storage Method And Coordinator Node

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510867678.4A CN106815258B (zh) 2015-11-27 2015-11-27 一种数据存储方法和协调节点
CN201510867678.4 2015-11-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/989,315 Continuation US20180276252A1 (en) 2015-11-27 2018-05-25 Data Storage Method And Coordinator Node

Publications (1)

Publication Number Publication Date
WO2017088666A1 true WO2017088666A1 (zh) 2017-06-01

Family

ID=58763012

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/105243 WO2017088666A1 (zh) 2015-11-27 2016-11-09 一种数据存储方法和协调节点

Country Status (4)

Country Link
US (1) US20180276252A1 (zh)
EP (1) EP3373158B1 (zh)
CN (1) CN106815258B (zh)
WO (1) WO2017088666A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019231B (zh) * 2017-12-26 2021-06-04 中国移动通信集团山东有限公司 一种并行数据库动态关联的方法及节点
US11500931B1 (en) * 2018-06-01 2022-11-15 Amazon Technologies, Inc. Using a graph representation of join history to distribute database data
CN110162523B (zh) * 2019-04-04 2020-09-01 阿里巴巴集团控股有限公司 数据存储方法、系统、装置及设备
US10917231B2 (en) 2019-04-04 2021-02-09 Advanced New Technologies Co., Ltd. Data storage method, apparatus, system and device
CN110275884B (zh) * 2019-05-31 2020-08-04 阿里巴巴集团控股有限公司 数据存储方法及节点
US11294875B2 (en) 2019-05-31 2022-04-05 Advanced New Technologies Co., Ltd. Data storage on tree nodes

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090100089A1 (en) * 2007-10-11 2009-04-16 Oracle International Corporation Reference partitioned tables
CN103383690A (zh) * 2012-05-04 2013-11-06 深圳市腾讯计算机系统有限公司 分布式数据存储方法及系统
CN103473334A (zh) * 2013-09-18 2013-12-25 浙江中控技术股份有限公司 数据存储、查询方法及系统
CN103577440A (zh) * 2012-07-27 2014-02-12 阿里巴巴集团控股有限公司 一种非关系型数据库中的数据处理方法和装置

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8504521B2 (en) * 2005-07-28 2013-08-06 Gopivotal, Inc. Distributed data management system
US8799272B2 (en) * 2007-07-20 2014-08-05 Hewlett-Packard Development Company, L.P. Data skew insensitive parallel join scheme
US9424287B2 (en) * 2008-12-16 2016-08-23 Hewlett Packard Enterprise Development Lp Continuous, automated database-table partitioning and database-schema evolution
JP5790270B2 (ja) * 2011-08-04 2015-10-07 富士通株式会社 構造解析システム,構造解析プログラムおよび構造解析方法
US9305074B2 (en) * 2013-06-19 2016-04-05 Microsoft Technology Licensing, Llc Skew-aware storage and query execution on distributed database systems
US9946750B2 (en) * 2013-12-01 2018-04-17 Actian Corporation Estimating statistics for generating execution plans for database queries
CN104809129B (zh) * 2014-01-26 2018-07-20 华为技术有限公司 一种分布式数据存储方法、装置和系统
US9465840B2 (en) * 2014-03-14 2016-10-11 International Business Machines Corporation Dynamically indentifying and preventing skewed partitions in a shared-nothing database
CN104462225B (zh) * 2014-11-12 2018-01-12 华为技术有限公司 一种数据读取的方法、装置及系统
US10303654B2 (en) * 2015-02-23 2019-05-28 Futurewei Technologies, Inc. Hybrid data distribution in a massively parallel processing architecture

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090100089A1 (en) * 2007-10-11 2009-04-16 Oracle International Corporation Reference partitioned tables
CN103383690A (zh) * 2012-05-04 2013-11-06 深圳市腾讯计算机系统有限公司 分布式数据存储方法及系统
CN103577440A (zh) * 2012-07-27 2014-02-12 阿里巴巴集团控股有限公司 一种非关系型数据库中的数据处理方法和装置
CN103473334A (zh) * 2013-09-18 2013-12-25 浙江中控技术股份有限公司 数据存储、查询方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3373158A4 *

Also Published As

Publication number Publication date
EP3373158A1 (en) 2018-09-12
US20180276252A1 (en) 2018-09-27
CN106815258B (zh) 2020-01-17
CN106815258A (zh) 2017-06-09
EP3373158B1 (en) 2022-10-26
EP3373158A4 (en) 2018-10-24

Similar Documents

Publication Publication Date Title
WO2017088666A1 (zh) 一种数据存储方法和协调节点
US11853323B2 (en) Adaptive distribution method for hash operations
US10606834B2 (en) Methods and apparatus of shared expression evaluation across RDBMS and storage layer
EP3285178B1 (en) Data query method in crossing-partition database, and crossing-partition query device
US9158812B2 (en) Enhancing parallelism in evaluation ranking/cumulative window functions
US11157473B2 (en) Multisource semantic partitioning
CN101727465B (zh) 分布式列存储数据库索引建立、查询方法及装置与系统
US10055480B2 (en) Aggregating database entries by hashing
US10635671B2 (en) Sort-merge band join optimization
US11030196B2 (en) Method and apparatus for processing join query
WO2015110062A1 (zh) 一种分布式数据存储方法、装置和系统
WO2015074466A1 (zh) 一种数据查询方法及装置
US6957210B1 (en) Optimizing an exclusion join operation using a bitmap index structure
US10437821B2 (en) Optimization of split queries
US20180173762A1 (en) System and Method of Adaptively Partitioning Data to Speed Up Join Queries on Distributed and Parallel Database Systems
US9229969B2 (en) Management of searches in a database system
US11423002B2 (en) Multilevel partitioning of objects
CN110413642B (zh) 一种应用无感知的分片数据库解析及优化方法
Cao et al. Efficient and Flexible Index Access in MapReduce.
CN114297221A (zh) 一种数据访问方法、装置及电子设备
CN114297260A (zh) 分布式rdf数据查询方法、装置和计算机设备
CN114637759A (zh) 数据查询方法、电子设备、存储介质
CN114443698A (zh) 一种大规模数据量并发处理的方法、系统及程序产品
CN114443601A (zh) 事物日志的处理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16867890

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016867890

Country of ref document: EP