CN103748578B - The method of data distribution, apparatus and system - Google Patents

The method of data distribution, apparatus and system Download PDF

Info

Publication number
CN103748578B
CN103748578B CN201280002465.XA CN201280002465A CN103748578B CN 103748578 B CN103748578 B CN 103748578B CN 201280002465 A CN201280002465 A CN 201280002465A CN 103748578 B CN103748578 B CN 103748578B
Authority
CN
China
Prior art keywords
distribution
data
data table
column
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201280002465.XA
Other languages
Chinese (zh)
Other versions
CN103748578A (en
Inventor
吴向阳
曹俊亮
曹莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN103748578A publication Critical patent/CN103748578A/en
Application granted granted Critical
Publication of CN103748578B publication Critical patent/CN103748578B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of method of data distribution, apparatus and system, it is related to areas of information technology, to save query time, improves search efficiency and invent.Methods described includes:Before data query, control node creates instruction according to rule setting distribution table is created, the distribution table creates the Distribution of A Sequence mark of the mark ID that logical data table is carried in instruction and selected Distribution of A Sequence, wherein described selected distribution is classified as Distribution of A Sequence in the logical data table, and the logical data table is the logical data table created in the control node;The control node sends the distribution table to back end and creates instruction, so that the back end creates the distribution table for indicating to create the logical data table according to the distribution table.During the data distribution in parallel database system.

Description

Data distribution method, device and system
Technical Field
The present invention relates to the field of information technologies, and in particular, to a method, an apparatus, and a system for data distribution.
Background
The parallel database system is a data storage technology for storing data contents on a plurality of data nodes in a distributed manner, and can distribute a logical data table on each data node according to algorithms such as Hash (Hash), Range (Range), Round-bin (Round-bin) and the like. The parallel database system queries the data content required by the user on each data node in parallel, and compared with a non-parallel database system, the parallel database system is high in query speed and easy to manage the data content.
Generally, a logical data table will contain a plurality of fields, and the parallel database system uses one (or more) field contents thereof as an argument of the above algorithm to perform distributed storage on the data nodes for the logical data table, and the field serving as the argument is referred to as a distributed column of the logical data table.
In the prior art, when a parallel database system performs joint (Join) query on distribution tables of multiple logical data tables, if the distribution tables having query relationships are different, the multiple logical data tables need to be redistributed according to the distribution table shared by the multiple distribution tables, thereby affecting query efficiency.
Disclosure of Invention
Embodiments of the present invention provide a method, an apparatus, and a system for data distribution, which can save query time and improve query efficiency.
In one aspect, an embodiment of the present invention provides a data distribution method, including:
before data query, a control node sets a distribution table creation instruction according to a creation rule, wherein the distribution table creation instruction carries an identification ID of a logic data table and a distribution column identification of a selected distribution column, the selected distribution column is a distribution column in the logic data table, and the logic data table is a logic data table created in the control node;
and the control node sends the distribution table creation instruction to a data node so that the data node creates the distribution table of the logic data table according to the distribution table creation instruction.
On the other hand, an embodiment of the present invention further provides a control node, including:
a processing unit, configured to set a distribution table creation instruction according to a creation rule before data query, where the distribution table creation instruction carries an identifier ID of a logical data table and a distribution column identifier of a selected distribution column, where the selected distribution column is a distribution column in the logical data table, and the logical data table is a logical data table created in the control node;
and the sending unit is used for sending the distribution table creation instruction set by the processing unit to a data node so that the data node can create the distribution table of the logic data table according to the distribution table creation instruction.
Finally, an embodiment of the present invention further provides a data distribution system, including:
the control node is used for setting a distribution table creation instruction according to a creation rule before data query, wherein the distribution table creation instruction carries an identification ID of a logic data table and a distribution column identification of a selected distribution column, the selected distribution column is the distribution column in the logic data table, and the logic data table is the logic data table created in the control node and sends the distribution table creation instruction to a data node;
and the data node is used for receiving the distribution table creation instruction sent by the control node before data query, and creating the distribution table of the logic data table according to the distribution table creation instruction.
The method, the device and the system for data distribution provided by the embodiment of the invention can distribute and store a plurality of logic data tables on each data node according to the same distribution column according to the creation rule before data query, and the distribution tables established on each data node after distribution and storage are used for subsequent data query. The problem that a plurality of logic data tables need to be redistributed and stored on each data node when the distribution columns of a plurality of distribution tables with query relations are different in the process of the combined query of the distribution columns can be solved, the query time delay caused by the migration of a large amount of data in the process of the combined query of the distribution columns can be avoided, and the query efficiency can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of an architecture of a parallel database system;
FIG. 2 is a schematic diagram of a logical data table in a parallel database system;
FIG. 3 is a schematic diagram of a data node creating a distribution table;
FIG. 4 is a schematic diagram of another data node creating a distribution table;
FIG. 5 is a flow chart of a method of data distribution in an embodiment of the present invention;
FIG. 6 is a flow chart of another method of data distribution in an embodiment of the present invention;
FIG. 7 is a diagram illustrating an application scenario in an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a control node according to an embodiment of the present invention;
FIG. 9 is a system diagram illustrating data distribution according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a structure of a parallel database system, in which a control node is connected to three data nodes, and the control node and each data node have independent Central Processing Units (CPUs), memories (internal memory, hard disk), and data communication is realized between the nodes through a high-speed network (e.g., ethernet, fiber-optic switching network). Wherein, the control node is mainly used for: 1. distributing and storing data to be managed to each data node according to fields of a logic data table and a preset algorithm, wherein the logic data table is a data table with data structure attributes and stored in a control node, and the logic data table is a logic basis for creating a distribution table for the data nodes; 2. providing a Query interface for a client, such as Structured Query Language (SQL), Java Database Connectivity (JDBC), Open Database Connectivity (ODBC), and the like; 3. and processing the query result fed back by each data node according to the query request of the client. The data node is mainly used for: and the controlled node is controlled to establish a distribution table of the logic data table according to the distribution column, perform data migration with other data nodes, and serve as an independent database node to realize storage and query of the distribution table.
Taking the storage of teaching data information as an example: in fig. 2, there are three logical data tables, a logical data table a (hereinafter, abbreviated as table a) stores data of a school number and a name, a logical data table B (hereinafter, abbreviated as table B) stores data of a school number, a course identifier and a score, and a logical data table C (hereinafter, abbreviated as table C) stores data of a course identifier and a course name. Any column (or referred to as field or attribute) in the logical data table can be used as a distribution column for the distributed storage of the logical data table, for example, the number, course identification and score in table B are three fields of table B. The control node takes the academic number field as a distribution column, and stores the data distribution of the table A to three data nodes by a Hash (Hash) modulo 3 algorithm. As shown in fig. 3, when the school number 1 is divided by 3 and 1, the names (i.e., the data in the first row in table a) corresponding to the school number 1 and the school number 1 are stored in the data node 1, when the school number 2 is divided by 3 and 2, the second row of data in table a is stored in the data node 2, and when the school number 3 is divided by 3 and 0, the third row of data in table a is stored in the data node 0, wherein the divisor 3 in the hash module 3 algorithm is the number of data nodes in the parallel database system. Similarly, the control node respectively takes the study number field and the course identification field as distribution columns, and stores the data distribution of the table B and the table C to the three data nodes. The control node distributes the table B by taking the school number field as a distribution column (j) of the table B through a Hash modulo 3 algorithm, and distributes the table C by taking the course identification field as a distribution column (k) of the table C through a Hash modulo 3 algorithm. After the logical data table is distributed, the data table created on each data node is referred to as a distribution table of the logical data table, and the name of the distribution table may be represented by a logical data table identifier + a data node identifier + a distribution column identifier, for example, B + mdt1+ j in data node 1, indicating that data node 1 creates the distribution table of table B with the study number field (j) as the distribution column. When the client executes the query statement select stu _ name, coarse _ id from A, Bwhere A: stu _ id ═ B: and stu _ id, the query statement represents the data of the query name, the course identification and the achievement and the corresponding relation (select stu _ name, course _ id) among the data of the query name, the course identification and the achievement by the combined query of the distribution columns of the table A and the table B, wherein the table A and the table B are respectively stored on each data node by taking the academic number field as the distribution column (A: stu _ id: B: stu _ id). The query result is that the data node 1 queries three course identifications corresponding to the study number 1 and three corresponding achievements, the data node 2 queries three course identifications corresponding to the study number 2 and three corresponding achievements, and the data node 0 queries three course identifications corresponding to the study number 3 and three corresponding achievements. The three data nodes report the query results to the control node respectively, and the control node collects the received query results and reports the collected query results to the client, so that the query is completed.
When the client executes the query statement Select stu _ id, coarse _ name from B, C where B: coarse _ id ═ C: when court _ id is used, the distribution column of the table B is a course number field, the distribution column of the table C is a course identification field, and the distribution columns of the two tables are different, so that the data node 1 can only inquire the course name of the course identification 1 corresponding to the course number 1, the course names of the course identifications 2 and 3 corresponding to the course number 1 cannot be inquired, and the course names of the course identifications 2 and 3 corresponding to the course number 1 cannot be inquired on the data node 2 and the data node 0, and similarly, the same problem exists in the data node 2 and the data node 0. At this time, the control node needs to redistribute table B with the course identification field as the distribution list (since table C has no study number field, in order to distribute table B and table C according to the same distribution list, it needs to redistribute table B according to the hash modulo 3 algorithm with the course identification field as the distribution list, and recreate the distribution list of table B). The distribution table recreated for table B on the three data nodes is shown in fig. 4, where k in the distribution table name is the distribution column identifier representing the course identifier field as the distribution column.
After the distribution table of the completion table B is created again, the control node controls the three data nodes to perform data migration, so that the data nodes acquire data corresponding to the newly added table entries in the distribution table reconstructed by the data nodes from other data nodes. For example, two rows of entries of course 1 and course 1 achievement corresponding to the school number 2 and the school number 3 are newly added to the distribution table B + mdt1+ k in the data node 1, and the data node 1 does not refer to the two rows of entries when creating the distribution table B + mdt1+ j, so that the data node 1 does not acquire data of the two rows of entries when creating the distribution table B + mdt1+ j. The data node 1 obtains the data of course 1 and the achievement of course 1 corresponding to the school number 2 (i.e. the data corresponding to the first row entry in B + mdt2+ j) from the data node 2, and obtains the data of course 1 and the achievement of course 1 corresponding to the school number 3 (i.e. the data corresponding to the first row entry in B + mdt0+ j) from the data node 0. After the data node 2 and the data node 0 reconstruct the distribution table of the table B, the corresponding steps are also executed according to the table entry features of themselves, which is not described in detail here.
And after acquiring the data required by the reconstructed table B distribution table, the data node adds the data to the corresponding table entry in the reconstructed table B distribution table, thereby completing the reconstruction of the table B distribution table. At this time, the control node can control the data node to query the course names of the three courses corresponding to each school number according to the distribution table of the table C and the distribution table of the reconstructed table B.
The summation of the distribution tables corresponding to the same logic data table on each data node can completely reflect the content of the logic data table.
An embodiment of the present invention provides a data distribution method, as shown in fig. 5, the method includes the following steps:
501. before data query, the control node sets a distribution table creation instruction according to a creation rule.
The distribution table creation instruction carries an identifier (Identity, abbreviated as ID) of a logical data table and a distribution column identifier of a selected distribution column, where the selected distribution column is a distribution column in the logical data table, and the logical data table is a logical data table created in the control node.
The ID of the logic data table is used for uniquely identifying the logic data table, and the distribution column identification is used for uniquely identifying the selected distribution column. The distribution table creation indication may carry one or more logical data table IDs and one or more distribution column identifiers, and when carrying a plurality of distribution column identifiers, the plurality of distribution column identifiers may be a plurality of distribution column identifiers in one logical data table, or a plurality of distribution column identifiers in a plurality of logical data tables.
Taking the distribution table B + mdt1+ j on data node 1 in fig. 4 as an example, B is the ID of table B, j is the distribution column ID with the academic number field as the distribution column, and mdt1 is the ID of the data node.
502. The control node sends a distribution table creation indication to the data node.
And the data nodes create the distribution table of the logic data table according to the logic data table ID and the distribution column identification in the creation indication of the distribution table, and complete data migration among the data nodes.
In the prior art, after a client inputs a query statement (i.e., after starting query), if a plurality of distribution tables participating in the combined query of the distribution columns are not distributed and stored according to the same distribution column, the corresponding logical data tables need to be redistributed on each data node according to the same distribution column, and data migration is performed among the data nodes, which may take a lot of query time and reduce query efficiency. The data distribution method provided by the embodiment of the invention can select the distribution column according to the creation rule in the data storage stage, and the data node is used for subsequent distribution column combined query based on the distribution table created by the selected distribution column. When the client inputs the query statement, the data nodes directly carry out the combined query of the distribution columns according to the plurality of distribution tables established according to the preset distribution columns, so that the query time can be saved, and the query efficiency can be improved.
Further, as a further extension of the embodiment shown in fig. 5, an embodiment of the present invention further provides a data distribution method, as shown in fig. 6, where the method includes the following steps:
601. before data query, the control node sets a distribution table creation instruction according to a statistical result creation rule of the logic data table in a preset period.
And the control node takes at least one field of the logic data table as a selected distribution column according to the statistical result of the data of the logic data table, and adds the distribution column identification of the selected distribution column and the ID of the logic data table to the distribution table creation indication.
The statistical objects of the control nodes are all created logical data tables in the parallel database system, namely table a, table B and table C in fig. 2. The statistical result comprises: in a preset period, the number of times of querying the logic data table, the proportion of the table entry data queried by the logic data table to the total table entry data of the logic data table, and the number of times of querying the distribution in the logic data table. The calling times of the logic data table are the times of the logic data table participating in the combined query of the distribution columns in a preset period; the proportion of the called table entry data of the logical data table to the total table entry data of the logical data table is that, in a preset period, the accumulated accessed amount of any row of table entry data in the logical data table accounts for the total table entry data of the logical data table, for example, a certain logical data table has three rows of table entry data in total, and the second threshold is 120%. If the first row entry data in the logical data table is queried 4 times in 5 minutes, the query quantity of the row entry data is 4 rows (cumulative value), and the query quantity of the row entry data accounts for 133% of the total entry data of the logical data table (4/3 ═ 1.33), and exceeds the second threshold value of 120%. Wherein, any row of table entry data is not limited to be queried by the distributed column union query. The number of times of querying the distribution column in the logic data table is that any field in the logic data table is queried as the distribution column in a preset period, and the query is not limited to the combined query of the distribution column.
Specifically, the method comprises the following steps:
A) when the number of times that the logical data table is queried exceeds a first threshold value within a preset period, the control node adds the ID of the logical data table and the distribution column identification of all fields (as distribution columns) in the logical data table to the distribution table creation indication.
B) In a preset period, when the proportion of the queried table entry data in the logical data table to the total table entry data in the logical data table exceeds a second threshold, the control node adds the ID of the logical data table and the distribution column identifier of all fields (as a distribution column) in the logical data table to the distribution table creation indication.
C) In a preset period, when the number of times of inquiring the distribution column in the logic data table exceeds a third threshold, the control node adds the ID of the logic data table and the distribution column identification of the distribution column, of which the number of times of inquiring exceeds the third threshold, in the distribution table creation indication.
In the embodiment of the present invention, the control node may set the distribution table creation indication according to any one of the three statistical results, for example:
when the number of times of querying the logic data table and the proportion of the table entry data queried by the logic data table to the total table entry data of the logic data table do not reach respective threshold values, the control node does not add the ID of the logic data table to the creation indication of the distribution table. When any one of the two conditions of the number of times of querying the logical data table or the proportion of the queried table entry data in the total table entry data of the logical data table reaches a corresponding threshold value, the control node adds the ID of the logical data table and the distribution column identification of all fields (as a distribution column) in the logical data table to the distribution table creation indication. When the number of times of querying the distribution column in the logic data table reaches a third threshold value, the control node adds the distribution column identification of the distribution column with the queried number reaching the third threshold value and the ID of the logic data table to the distribution table creation indication.
Optionally, the control node may also combine the three statistical results to serve as a basis for adding the logical data table ID and the distribution column identifier. For example, within a preset period of 5 minutes, if a certain logical data table satisfies the logical expressions (JoinTimes > 6) and (packed Lines percentage > 180%) and (CFrenqence > 8), the ID of the logical data table and the corresponding distribution column identifier are added to the distribution table creation indication. Wherein, (Join Times > 6) indicates that the number of Times of querying the logic data table is greater than 6, (cached Lines percentage > 180%) indicates that the ratio of the query quantity of a certain entry data in the logic data table to the total entry data of the logic data table is greater than 180%, (CFrenqence > 8) indicates that the number of Times of querying a certain distribution column in the logic data table is greater than 8, and indicates that the three decision conditions are in a and relationship, that is, the three decision conditions are simultaneously satisfied. Alternatively, the logical expression of a logical data table may be (Paccessed Lines percentage > 5/Join Times > 50%) and (CFrenqence > 6), where "/" represents an OR relationship, and two decision conditions are selected.
The statement that the control node controls the data node to create the distribution table is as follows:
the accented font is a newly added definition in the current standard SQL language, and is explained as follows:
the statement that the control node controls the data node to create the distribution table is as follows: [ Distribution on KEY column _ name [, column _ name, … ] … a11 owultipledistribution, where column _ name is Distribution column identification, column _ name [, column _ name, … ] indicates that a plurality of Distribution columns may be used as arguments of a Distribution algorithm, and allowMultipleDistribution indicates that creation of a Distribution table is permitted.
In addition, the control node can add the ID of the logical data table and the distribution column identifier of the selected distribution column of the logical data table to the distribution table creation instruction according to the creation instruction of the client. For example, the creation instruction received by the control node includes a logical data table ID selected by a querier or a database administrator and a distribution column identifier of at least one distribution column in the logical data table. The control node adds the logical data table ID in the create indication and the distribution column identification of the selected distribution column to the distribution table create indication.
Furthermore, in order to save the storage space of the data nodes, the control node may also delete the distribution table in the data nodes that is not queried or is rarely queried periodically. Specifically, the control node counts the number of times that each distribution table in the data node is queried, and if a distribution table with the queried number of times smaller than a fourth threshold exists in a preset period, the control node sends a distribution table deletion instruction to the data node to which the distribution table belongs, where the distribution table deletion instruction carries the distribution column identifier of the distribution column corresponding to the distribution table with the queried number of times smaller than the fourth threshold, and the ID of the logical data table to which the distribution column belongs. And after receiving the distribution table deleting indication, the data node deletes the corresponding distribution table according to the logic data table ID and the distribution column identification carried in the distribution table.
Further, to avoid the data node from repeatedly creating the distribution table, when the control node sets the distribution table creation instruction, it may further determine whether the data node has created the distribution table of the logical data table according to the ID of the logical data table and the distribution column identifier of the selected distribution column, where the ID of the logical data table is the ID of the logical data table added by the control node to the distribution table creation instruction, and the distribution column identifier of the selected distribution column is the distribution column identifier added by the control node to the distribution table creation instruction. If the data node has created the distribution table of the logical data table according to the ID of the logical data table and the distribution column identification of the selected distribution column, the control node does not add the ID of the logical data table and the distribution column identification of the selected distribution column to the distribution table creation indication.
602. The control node sends a distribution table creation indication to the data node.
And the control node sends a distribution table creation instruction to the data node, so that the data node creates a distribution table for the logic data table according to the ID, the distribution column identification and a preset distribution algorithm of the logic data table in the distribution table creation instruction.
Specifically, the distribution tables created for Table B are shown in the tables B + mdt1+ k, B + mdt2+ k, and B + mdt0+ k in FIG. 4. The data node may create a distribution table with one field of the logical data table as a distribution column (as shown in fig. 3), or may create a distribution table with two or more fields of the logical data table as distribution columns. For example, the school number field and the achievement field are summed as arguments of the modulo-3 algorithm. When a data node creates a distribution table based on a field (e.g., a school number field) of a logical data table, the data node may create the maximum number of distribution tables as the number of fields of the logical data table. When a distribution table is created according to two or more fields of the logical data table, the maximum number of the distribution tables which can be created by the data node is the permutation and combination of at least two fields of the logical data table, and the number of the distribution tables which can be created is larger than the number of the fields in the logical data table.
In the embodiment of the present invention, the distribution algorithm used when the data nodes distribute or create the distribution table includes, but is not limited to, a hash algorithm, a range algorithm, and a round robin algorithm.
After the data node completes creation of the distribution table according to the distribution table creation instruction, data included in the newly created distribution table but not stored in the data node needs to be acquired from other data nodes, so that data migration between the data nodes is completed.
For example, in fig. 4, after creating the distribution table B + mdt1+ k, data node 1 obtains all the entry data in the B + mdt2+ j table and the B + mdt0+ j table sent by data node 2 and data node 0.
Further, to reduce the data migration amount, the data node 1 may also obtain only the data of the first row entry in B + mdt2+ j sent by the data node 2, and the data of the first row entry in B + mdt0+ j sent by the data node 0.
After the data node creates the distribution table and performs data migration, when the client initiates a query request, the control node sends a query instruction to the data node so that the data node performs combined query of the distribution list.
When performing the distribution column joint query, the distribution table with smaller query cost including but not limited to the data migration amount may be selected for query. For example, when the client executes the query statements Select stu _ id, coarse _ name from B, C where B: coarse _ id ═ C: when course _ id is detected, the data tables participating in the joint query of the distribution columns on the data node 1 are B + mdt1+ k and C + mdt1+ k, and the two tables are distributed according to the same distribution column (course identification k). The data node 1 can search and report the achievements of the course 1 corresponding to the three school numbers to the control node. Similarly, the data node 2 and the data node 0 can search and report the achievements of the courses 2 and 3 corresponding to the three school numbers to the control node. And the control node summarizes and feeds back the query results reported by the three data nodes to the client to complete the query.
The embodiment of the present invention is described by taking an example that two data tables participate in the joint query of the distribution columns, and in practical applications, the number of the data tables participating in the joint query of the distribution columns may be more than three.
In the prior art, the distribution table is established and data migration is performed according to the query condition after the client inputs the query statement (i.e., after the query is started), and data migration in the query process occupies a large amount of query time, thereby reducing the query efficiency. The data distribution method provided by the embodiment of the invention can select the distribution column according to the statistical result of the preset data or the indication of the client in the data storage stage, and the data node is used for subsequent distribution column combined query based on the distribution table created by the selected distribution column. When the client inputs the query statement, the data nodes directly carry out the combined query of the distribution columns according to the plurality of distribution tables established according to the preset distribution columns, so that the query time can be saved, and the query efficiency can be improved.
In an application scenario of the embodiment of the present invention, as shown in fig. 7, the control node may instruct the data node to create the distribution table of table B again according to the client instruction or the statistical result. Specifically, 701, the client sends a distribution table creation instruction to the control node, and the instruction is to create a distribution table for table B in the parallel database system. 702. And the control node respectively sends a distribution table creation instruction to the three data nodes, wherein the distribution table creation instruction carries the ID of the table B and the identification of a distribution column (course identification field). 703. The three data nodes create a distribution table from the distribution table indicating the creation table B. 704. And the control node respectively sends data migration instructions to the three data nodes. 705. And the data node performs data migration according to the data migration indication. 706. And the data node sends a creation success message to the control node. 707. And after receiving the creation success message sent by the data node, the control node sends the creation success message to the client.
The application scenario shown in fig. 7 may create a distribution table on each data node before a client queries and complete data migration between data nodes. Since the distribution table is created before the client queries, the distribution column for creating the distribution table cannot be determined according to the keywords in the query statement, so that the distribution table is created in the application scenario shown in fig. 7 according to the statistical result of the user query data or the logical data table ID and the distribution column set in advance by the client, and the created distribution table is used for subsequent distribution column join query. The steps of creating the distribution table and data migration are transferred to the client side before query, so that the time delay of creating the distribution table and data migration in the query process to query is saved, and the query efficiency can be improved.
In the embodiment of the present invention and the application scenario shown in fig. 7, the data migration is specifically that the source data node copies data stored by itself, and transmits the copied data to the destination data node. Data stored in the source data node still exists after data migration, and in practical application, the data migration is a known technical means of those skilled in the art, and this is not described in more detail in the embodiments of the present invention.
Referring to implementation of the method embodiment shown in fig. 6, an embodiment of the present invention further provides a control node, which is used to implement the method embodiment shown in fig. 6. As shown in fig. 8, the control node includes: a processing unit 81, a transmitting unit 82, and a receiving unit 83, wherein,
the processing unit 81 is configured to set a distribution table creation instruction according to a creation rule before data query, where the distribution table creation instruction carries an ID of a logical data table and a distribution column identifier of a selected distribution column, where the selected distribution column is a distribution column in the logical data table, and the logical data table is a logical data table created in the control node;
the sending unit 82 is configured to send the distribution table creation instruction set by the processing unit 81 to a data node, so that the data node creates the distribution table of the logical data table according to the distribution table creation instruction.
Further, the processing unit 81 is specifically configured to: and counting the data of the logic data table in a preset period to obtain a statistical result, and adding the ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication according to the statistical result.
Further, the receiving unit 83 is configured to receive a creation instruction of a client, where the creation instruction carries an ID of the logical data table and a distribution column identifier of the selected distribution column;
the processing unit 81 is further specifically configured to:
adding the ID of the logical data table and the distribution column identifier of the selected distribution column carried in the creation indication received by the receiving unit 83 to the distribution table creation indication.
Further, the processing unit 81 is further specifically configured to: at least one of the following data is counted in a preset period: the number of times of querying the logic data table, the proportion of the queried table entry data in the logic data table in the total table entry data of the logic data table, and the number of times of querying the distribution in the logic data table;
when the number of times that the logical data table is queried exceeds a first threshold value, adding the ID of the logical data table and the distribution column identification of all distribution columns in the logical data table into the distribution table creation indication;
when the proportion of the inquired table entry data in the logical data table to the total table entry data in the logical data table exceeds a second threshold value, adding the ID of the logical data table and the distribution column identifiers of all distribution columns in the logical data table to the creation indication of the distribution table;
when the number of times of inquiring the distribution column in the logic data table exceeds a third threshold value, adding the ID of the logic data table and the distribution column identification of the distribution column, of which the number of times of inquiring exceeds the third threshold value, in the distribution table creation indication.
Further, the sending unit 82 is further configured to: when the processing unit 81 counts that the number of times of querying a distribution table corresponding to a distribution list in the logical data table is smaller than a fourth threshold, sending a distribution table deletion instruction to the data node, where the distribution table deletion instruction carries a distribution list identifier of the distribution list corresponding to the distribution table whose queried number of times is smaller than the fourth threshold and an ID of the logical data table, and the distribution table deletion instruction is used to instruct the data node to delete the distribution table whose queried number of times is smaller than the fourth threshold.
Further, the processing unit 81 is further configured to: before the control node adds the ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication, judging whether the data node creates the distribution table of the logic data table according to the ID of the logic data table and the distribution column identification of the selected distribution column;
when the data node does not create the distribution table of the logic data table according to the ID of the logic data table and the distribution column identification of the selected distribution column, adding the ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication;
when the data node has created the distribution table of the logical data table according to the ID of the logical data table and the distribution column identifier of the selected distribution column, not adding the ID of the logical data table and the distribution column identifier of the selected distribution column to the distribution table creation indication.
The control node provided by the embodiment of the invention can select the distribution column according to the statistical result of the preset data or the indication of the client in the data storage stage, and the data node is used for subsequent distribution column combined query based on the distribution table created by the selected distribution column. When the client inputs the query statement, the data nodes directly carry out the combined query of the distribution columns according to the plurality of distribution tables established according to the preset distribution columns, so that the query time can be saved, and the query efficiency can be improved.
Further, the embodiment of the present invention also provides a data distribution system, as shown in fig. 9, the system includes a control node 91 and at least three data nodes 92, wherein,
the control node 91 is configured to set a distribution table creation instruction according to a creation rule before data query, where the distribution table creation instruction carries an identifier ID of a logical data table and a distribution column identifier of a selected distribution column, where the selected distribution column is a distribution column in the logical data table, and the logical data table is a created logical data table in the control node, and send the distribution table creation instruction to the data node 92.
The data node 92 is configured to receive the distribution table creation instruction sent by the control node 91 before data query, and create the distribution table of the logical data table according to the distribution table creation instruction.
The data distribution system provided by the embodiment of the present invention is described by taking three data nodes 92 as an example, and the number of the data nodes 92 is not limited in practical application.
The data distribution system provided by the embodiment of the invention can select the distribution column by the control node according to the statistical result of the preset data or the indication of the client in the data storage stage, and the data node is used for subsequent distribution column combined query based on the distribution table created by the selected distribution column. When the client inputs the query statement, the data nodes directly carry out the combined query of the distribution columns according to the plurality of distribution tables established according to the preset distribution columns, so that the query time can be saved, and the query efficiency can be improved.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A method of data distribution, comprising:
before data query, a control node sets a distribution table creation instruction according to a creation rule, wherein the distribution table creation instruction carries an identification ID of a logic data table and a distribution column identification of a selected distribution column, the selected distribution column is a distribution column in the logic data table, and the logic data table is a logic data table created in the control node;
the control node sends the distribution table creation instruction to a data node so that the data node creates a distribution table of the logic data table according to the distribution table creation instruction;
the control node sets a distribution table creation instruction according to a creation rule, and specifically includes: the control node judges whether the data node creates a distribution table of the logic data table according to the identification ID of the logic data table and the distribution column identification of the selected distribution column; when the data node does not create the distribution table of the logic data table according to the identification ID of the logic data table and the distribution column identification of the selected distribution column, the control node adds the identification ID of the logic data table and the distribution column identification of the selected distribution column to the distribution table creation indication; when the data node has created the distribution table of the logical data table according to the identification ID of the logical data table and the distribution column identification of the selected distribution column, the control node does not add the identification ID of the logical data table and the distribution column identification of the selected distribution column to the distribution table creation indication.
2. The method according to claim 1, wherein the control node sets a distribution table creation instruction according to a creation rule, specifically comprising:
the control node counts the data of the logic data table in a preset period to obtain a statistical result, and adds the identification ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication according to the statistical result; or,
the control node receives a creation instruction of a client, the creation instruction carries the identification ID of the logic data table and the distribution column identification of the selected distribution column, and the control node adds the identification ID of the logic data table and the distribution column identification of the selected distribution column carried in the creation instruction to the distribution table creation instruction.
3. The method according to claim 2, wherein the counting of the data of the logical data table by the control node in a preset period specifically includes:
in a preset period, the control node counts at least one of the following data: the number of times of querying the logic data table, the proportion of the queried table entry data in the logic data table in the total table entry data of the logic data table, and the number of times of querying the distribution in the logic data table;
adding the identifier ID of the logical data table and the distribution column identifier of the selected distribution column to the distribution table creation instruction according to the statistical result specifically includes:
when the number of times that the logical data table is queried exceeds a first threshold value, the control node adds the identification ID of the logical data table and the distribution column identifications of all distribution columns in the logical data table to the distribution table creation indication; and/or the presence of a gas in the gas,
when the proportion of the inquired table entry data in the logical data table to the total table entry data in the logical data table exceeds a second threshold, the control node adds the identification ID of the logical data table and the distribution column identifications of all distribution columns in the logical data table to the distribution table creation indication; and/or the presence of a gas in the gas,
when the number of times of inquiring the distribution column in the logic data table exceeds a third threshold, the control node adds the identification ID of the logic data table and the distribution column identification of the distribution column, the number of times of inquiring the distribution column in the logic data table exceeds the third threshold, to the creation indication of the distribution table.
4. The method of claim 3, wherein when the number of times that the distribution table corresponding to the distribution column in the logical data table is queried is less than a fourth threshold, the method further comprises:
and the control node sends a distribution table deletion instruction to the data node, wherein the distribution table deletion instruction carries the distribution column identifier of the distribution column corresponding to the distribution table with the queried number of times smaller than the fourth threshold and the identifier ID of the logic data table, and the distribution table deletion instruction is used for indicating the data node to delete the distribution table with the queried number of times smaller than the fourth threshold.
5. A control node, comprising:
a processing unit, configured to set a distribution table creation instruction according to a creation rule before data query, where the distribution table creation instruction carries an identifier ID of a logical data table and a distribution column identifier of a selected distribution column, where the selected distribution column is a distribution column in the logical data table, and the logical data table is a logical data table created in the control node;
a sending unit, configured to send the distribution table creation instruction set by the processing unit to a data node, so that the data node creates a distribution table of the logical data table according to the distribution table creation instruction;
the processing unit is specifically configured to determine whether the data node has created a distribution table of the logical data table according to the identifier ID of the logical data table and the distribution column identifier of the selected distribution column; when the data node does not create the distribution table of the logic data table according to the identification ID of the logic data table and the distribution column identification of the selected distribution column, adding the identification ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication; when the data node has created the distribution table of the logical data table according to the identification ID of the logical data table and the distribution column identification of the selected distribution column, not adding the identification ID of the logical data table and the distribution column identification of the selected distribution column to the distribution table creation indication.
6. The control node according to claim 5, wherein the processing unit is specifically configured to:
and counting the data of the logic data table in a preset period to obtain a statistical result, and adding the identification ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication according to the statistical result.
7. The control node according to claim 5, further comprising a receiving unit, configured to receive a creation instruction of a client, where the creation instruction carries an identifier ID of the logical data table and a distribution column identifier of the selected distribution column;
the processing unit is further specifically configured to:
and adding the identification ID of the logic data table and the distribution column identification of the selected distribution column carried in the creation indication received by the receiving unit into the distribution table creation indication.
8. The control node of claim 6, wherein the processing unit is further specifically configured to:
at least one of the following data is counted in a preset period: the number of times of querying the logic data table, the proportion of the queried table entry data in the logic data table in the total table entry data of the logic data table, and the number of times of querying the distribution in the logic data table;
when the number of times that the logical data table is queried exceeds a first threshold value, adding the identification ID of the logical data table and the distribution column identifications of all distribution columns in the logical data table into a distribution table creation indication;
when the proportion of the inquired table entry data in the logic data table to the total table entry data in the logic data table exceeds a second threshold value, adding the identification ID of the logic data table and the distribution column identifications of all distribution columns in the logic data table to the creation indication of the distribution table;
when the number of times of inquiring the distribution column in the logic data table exceeds a third threshold value, adding the identification ID of the logic data table and the distribution column identification of the distribution column, the number of times of inquiring the distribution column in the logic data table exceeds the third threshold value, into the creation indication of the distribution table.
9. The control node according to claim 8, wherein the sending unit is further configured to: when the processing unit counts that the number of times of querying a distribution table corresponding to a distribution list in the logical data table is smaller than a fourth threshold, sending a distribution table deletion instruction to the data node, where the distribution table deletion instruction carries a distribution list identifier of the distribution list corresponding to the distribution table with the queried number of times smaller than the fourth threshold and an identifier ID of the logical data table, and the distribution table deletion instruction is used to instruct the data node to delete the distribution table with the queried number of times smaller than the fourth threshold.
10. A system for data distribution, the system comprising a control node and a data node, the system comprising:
the control node is configured to set a distribution table creation instruction according to a creation rule before data query, where the distribution table creation instruction carries an identifier ID of a logical data table and a distribution column identifier of a selected distribution column, where the selected distribution column is a distribution column in the logical data table, and the logical data table is a created logical data table in the control node and sends the distribution table creation instruction to the data node;
the data node is configured to receive the distribution table creation instruction sent by the control node before data query, and create the distribution table of the logical data table according to the distribution table creation instruction;
the control node is specifically configured to determine whether the data node has created a distribution table of the logical data table according to the identifier ID of the logical data table and the distribution column identifier of the selected distribution column; when the data node does not create the distribution table of the logic data table according to the identification ID of the logic data table and the distribution column identification of the selected distribution column, adding the identification ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication; when the data node has created the distribution table of the logical data table according to the identification ID of the logical data table and the distribution column identification of the selected distribution column, not adding the identification ID of the logical data table and the distribution column identification of the selected distribution column to the distribution table creation indication.
CN201280002465.XA 2012-07-26 2012-07-26 The method of data distribution, apparatus and system Active CN103748578B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/079173 WO2014015492A1 (en) 2012-07-26 2012-07-26 Data distribution method, device, and system

Publications (2)

Publication Number Publication Date
CN103748578A CN103748578A (en) 2014-04-23
CN103748578B true CN103748578B (en) 2017-10-10

Family

ID=49996501

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201280002465.XA Active CN103748578B (en) 2012-07-26 2012-07-26 The method of data distribution, apparatus and system

Country Status (2)

Country Link
CN (1) CN103748578B (en)
WO (1) WO2014015492A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022240906A1 (en) * 2021-05-11 2022-11-17 Strong Force Vcn Portfolio 2019, Llc Systems, methods, kits, and apparatuses for edge-distributed storage and querying in value chain networks
US20220261389A1 (en) * 2021-02-18 2022-08-18 International Business Machines Corporation Distributing rows of a table in a distributed database system
US12039559B2 (en) 2021-04-16 2024-07-16 Strong Force Vcn Portfolio 2019, Llc Control tower encoding of cross-product data structure

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916261A (en) * 2010-07-28 2010-12-15 北京播思软件技术有限公司 Data partitioning method for distributed parallel database system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7562090B2 (en) * 2002-12-19 2009-07-14 International Business Machines Corporation System and method for automating data partitioning in a parallel database
CA2556979A1 (en) * 2004-02-21 2005-10-20 Datallegro, Inc. Ultra-shared-nothing parallel database
CN102033889B (en) * 2009-09-29 2012-08-22 熊凡凡 Distributed database parallel processing system
CN102375853A (en) * 2010-08-24 2012-03-14 中国移动通信集团公司 Distributed database system, method for building index therein and query method
US8326825B2 (en) * 2010-11-05 2012-12-04 Microsoft Corporation Automated partitioning in parallel database systems
CN102122306A (en) * 2011-03-28 2011-07-13 中国人民解放军国防科学技术大学 Data processing method and distributed file system applying same
CN102323946B (en) * 2011-09-05 2013-03-27 天津神舟通用数据技术有限公司 Implementation method for operator reuse in parallel database

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101916261A (en) * 2010-07-28 2010-12-15 北京播思软件技术有限公司 Data partitioning method for distributed parallel database system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
适用于云计算的面向查询数据库数据分布策略;文明波,丁治明;《计算机科学》;20100930;第37卷(第9期);第170-172页 *

Also Published As

Publication number Publication date
WO2014015492A1 (en) 2014-01-30
CN103748578A (en) 2014-04-23

Similar Documents

Publication Publication Date Title
US10095725B2 (en) Combinators
US7523130B1 (en) Storing and retrieving objects on a computer network in a distributed database
JP5719323B2 (en) Distributed processing system, dispatcher and distributed processing management device
CN106933989B (en) System and method for publishing information on network
US20140074774A1 (en) Distributed data base system and data structure for distributed data base
CN109189782A (en) A kind of indexing means in block chain commodity transaction inquiry
TW201319982A (en) Real-time de-duplication method of product information and device thereof
US20150169656A1 (en) Distributed database system
CN109146677B (en) Method, computer system and readable storage medium for parallel building of block chain views
EP2991280B1 (en) Content sharing method and social synchronizing apparatus
Ma et al. On benchmarking online social media analytical queries
US20110179108A1 (en) System for Aggregating Information and Delivering User Specific Content
CN103678550A (en) Mass data real-time query method based on dynamic index structure
CN106471501A (en) The method of data query, the storage method data system of data object
CN103748578B (en) The method of data distribution, apparatus and system
Montoya et al. Towards efficient query processing over heterogeneous RDF interfaces
EP3652660B1 (en) Systems and methods for joining datasets
CN110019786A (en) Topic sending method, the topic list ordering method and device of Web Community
CN111259062B (en) Method and device capable of guaranteeing sequence of statement result set of full-table query of distributed database
CN108769166A (en) A kind of CDN cache contents managing devices based on metadata
Guo et al. Parallelizing the extraction of fresh information from online social networks
Bao et al. Query optimization of massive social network data based on hbase
CN111026759B (en) Report generation method and device based on Hbase
CN108536758B (en) Data table reconstruction method, device and system for database mode
Asthana et al. Retrieval of highly dynamic information in an unstructured peer-to-peer network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220216

Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province

Patentee after: Huawei Cloud Computing Technologies Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.