CN103748578B

CN103748578B - The method of data distribution, apparatus and system

Info

Publication number: CN103748578B
Application number: CN201280002465.XA
Authority: CN
Inventors: 吴向阳; 曹俊亮; 曹莉
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2012-07-26
Filing date: 2012-07-26
Publication date: 2017-10-10
Anticipated expiration: 2032-07-26
Also published as: WO2014015492A1; CN103748578A

Abstract

The invention discloses a kind of method of data distribution, apparatus and system, it is related to areas of information technology, to save query time, improves search efficiency and invent.Methods described includes：Before data query, control node creates instruction according to rule setting distribution table is created, the distribution table creates the Distribution of A Sequence mark of the mark ID that logical data table is carried in instruction and selected Distribution of A Sequence, wherein described selected distribution is classified as Distribution of A Sequence in the logical data table, and the logical data table is the logical data table created in the control node；The control node sends the distribution table to back end and creates instruction, so that the back end creates the distribution table for indicating to create the logical data table according to the distribution table.During the data distribution in parallel database system.

Description

Data distribution method, device and system

Technical Field

The present invention relates to the field of information technologies, and in particular, to a method, an apparatus, and a system for data distribution.

Background

The parallel database system is a data storage technology for storing data contents on a plurality of data nodes in a distributed manner, and can distribute a logical data table on each data node according to algorithms such as Hash (Hash), Range (Range), Round-bin (Round-bin) and the like. The parallel database system queries the data content required by the user on each data node in parallel, and compared with a non-parallel database system, the parallel database system is high in query speed and easy to manage the data content.

Generally, a logical data table will contain a plurality of fields, and the parallel database system uses one (or more) field contents thereof as an argument of the above algorithm to perform distributed storage on the data nodes for the logical data table, and the field serving as the argument is referred to as a distributed column of the logical data table.

In the prior art, when a parallel database system performs joint (Join) query on distribution tables of multiple logical data tables, if the distribution tables having query relationships are different, the multiple logical data tables need to be redistributed according to the distribution table shared by the multiple distribution tables, thereby affecting query efficiency.

Disclosure of Invention

Embodiments of the present invention provide a method, an apparatus, and a system for data distribution, which can save query time and improve query efficiency.

In one aspect, an embodiment of the present invention provides a data distribution method, including:

before data query, a control node sets a distribution table creation instruction according to a creation rule, wherein the distribution table creation instruction carries an identification ID of a logic data table and a distribution column identification of a selected distribution column, the selected distribution column is a distribution column in the logic data table, and the logic data table is a logic data table created in the control node;

and the control node sends the distribution table creation instruction to a data node so that the data node creates the distribution table of the logic data table according to the distribution table creation instruction.

On the other hand, an embodiment of the present invention further provides a control node, including:

a processing unit, configured to set a distribution table creation instruction according to a creation rule before data query, where the distribution table creation instruction carries an identifier ID of a logical data table and a distribution column identifier of a selected distribution column, where the selected distribution column is a distribution column in the logical data table, and the logical data table is a logical data table created in the control node;

and the sending unit is used for sending the distribution table creation instruction set by the processing unit to a data node so that the data node can create the distribution table of the logic data table according to the distribution table creation instruction.

Finally, an embodiment of the present invention further provides a data distribution system, including:

the control node is used for setting a distribution table creation instruction according to a creation rule before data query, wherein the distribution table creation instruction carries an identification ID of a logic data table and a distribution column identification of a selected distribution column, the selected distribution column is the distribution column in the logic data table, and the logic data table is the logic data table created in the control node and sends the distribution table creation instruction to a data node;

and the data node is used for receiving the distribution table creation instruction sent by the control node before data query, and creating the distribution table of the logic data table according to the distribution table creation instruction.

The method, the device and the system for data distribution provided by the embodiment of the invention can distribute and store a plurality of logic data tables on each data node according to the same distribution column according to the creation rule before data query, and the distribution tables established on each data node after distribution and storage are used for subsequent data query. The problem that a plurality of logic data tables need to be redistributed and stored on each data node when the distribution columns of a plurality of distribution tables with query relations are different in the process of the combined query of the distribution columns can be solved, the query time delay caused by the migration of a large amount of data in the process of the combined query of the distribution columns can be avoided, and the query efficiency can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of an architecture of a parallel database system;

FIG. 2 is a schematic diagram of a logical data table in a parallel database system;

FIG. 3 is a schematic diagram of a data node creating a distribution table;

FIG. 4 is a schematic diagram of another data node creating a distribution table;

FIG. 5 is a flow chart of a method of data distribution in an embodiment of the present invention;

FIG. 6 is a flow chart of another method of data distribution in an embodiment of the present invention;

FIG. 7 is a diagram illustrating an application scenario in an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a control node according to an embodiment of the present invention;

FIG. 9 is a system diagram illustrating data distribution according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a structure of a parallel database system, in which a control node is connected to three data nodes, and the control node and each data node have independent Central Processing Units (CPUs), memories (internal memory, hard disk), and data communication is realized between the nodes through a high-speed network (e.g., ethernet, fiber-optic switching network). Wherein, the control node is mainly used for: 1. distributing and storing data to be managed to each data node according to fields of a logic data table and a preset algorithm, wherein the logic data table is a data table with data structure attributes and stored in a control node, and the logic data table is a logic basis for creating a distribution table for the data nodes; 2. providing a Query interface for a client, such as Structured Query Language (SQL), Java Database Connectivity (JDBC), Open Database Connectivity (ODBC), and the like; 3. and processing the query result fed back by each data node according to the query request of the client. The data node is mainly used for: and the controlled node is controlled to establish a distribution table of the logic data table according to the distribution column, perform data migration with other data nodes, and serve as an independent database node to realize storage and query of the distribution table.

Taking the storage of teaching data information as an example: in fig. 2, there are three logical data tables, a logical data table a (hereinafter, abbreviated as table a) stores data of a school number and a name, a logical data table B (hereinafter, abbreviated as table B) stores data of a school number, a course identifier and a score, and a logical data table C (hereinafter, abbreviated as table C) stores data of a course identifier and a course name. Any column (or referred to as field or attribute) in the logical data table can be used as a distribution column for the distributed storage of the logical data table, for example, the number, course identification and score in table B are three fields of table B. The control node takes the academic number field as a distribution column, and stores the data distribution of the table A to three data nodes by a Hash (Hash) modulo 3 algorithm. As shown in fig. 3, when the school number 1 is divided by 3 and 1, the names (i.e., the data in the first row in table a) corresponding to the school number 1 and the school number 1 are stored in the data node 1, when the school number 2 is divided by 3 and 2, the second row of data in table a is stored in the data node 2, and when the school number 3 is divided by 3 and 0, the third row of data in table a is stored in the data node 0, wherein the divisor 3 in the hash module 3 algorithm is the number of data nodes in the parallel database system. Similarly, the control node respectively takes the study number field and the course identification field as distribution columns, and stores the data distribution of the table B and the table C to the three data nodes. The control node distributes the table B by taking the school number field as a distribution column (j) of the table B through a Hash modulo 3 algorithm, and distributes the table C by taking the course identification field as a distribution column (k) of the table C through a Hash modulo 3 algorithm. After the logical data table is distributed, the data table created on each data node is referred to as a distribution table of the logical data table, and the name of the distribution table may be represented by a logical data table identifier + a data node identifier + a distribution column identifier, for example, B + mdt1+ j in data node 1, indicating that data node 1 creates the distribution table of table B with the study number field (j) as the distribution column. When the client executes the query statement select stu _ name, coarse _ id from A, Bwhere A: stu _ id ═ B: and stu _ id, the query statement represents the data of the query name, the course identification and the achievement and the corresponding relation (select stu _ name, course _ id) among the data of the query name, the course identification and the achievement by the combined query of the distribution columns of the table A and the table B, wherein the table A and the table B are respectively stored on each data node by taking the academic number field as the distribution column (A: stu _ id: B: stu _ id). The query result is that the data node 1 queries three course identifications corresponding to the study number 1 and three corresponding achievements, the data node 2 queries three course identifications corresponding to the study number 2 and three corresponding achievements, and the data node 0 queries three course identifications corresponding to the study number 3 and three corresponding achievements. The three data nodes report the query results to the control node respectively, and the control node collects the received query results and reports the collected query results to the client, so that the query is completed.

When the client executes the query statement Select stu _ id, coarse _ name from B, C where B: coarse _ id ═ C: when court _ id is used, the distribution column of the table B is a course number field, the distribution column of the table C is a course identification field, and the distribution columns of the two tables are different, so that the data node 1 can only inquire the course name of the course identification 1 corresponding to the course number 1, the course names of the course identifications 2 and 3 corresponding to the course number 1 cannot be inquired, and the course names of the course identifications 2 and 3 corresponding to the course number 1 cannot be inquired on the data node 2 and the data node 0, and similarly, the same problem exists in the data node 2 and the data node 0. At this time, the control node needs to redistribute table B with the course identification field as the distribution list (since table C has no study number field, in order to distribute table B and table C according to the same distribution list, it needs to redistribute table B according to the hash modulo 3 algorithm with the course identification field as the distribution list, and recreate the distribution list of table B). The distribution table recreated for table B on the three data nodes is shown in fig. 4, where k in the distribution table name is the distribution column identifier representing the course identifier field as the distribution column.

After the distribution table of the completion table B is created again, the control node controls the three data nodes to perform data migration, so that the data nodes acquire data corresponding to the newly added table entries in the distribution table reconstructed by the data nodes from other data nodes. For example, two rows of entries of course 1 and course 1 achievement corresponding to the school number 2 and the school number 3 are newly added to the distribution table B + mdt1+ k in the data node 1, and the data node 1 does not refer to the two rows of entries when creating the distribution table B + mdt1+ j, so that the data node 1 does not acquire data of the two rows of entries when creating the distribution table B + mdt1+ j. The data node 1 obtains the data of course 1 and the achievement of course 1 corresponding to the school number 2 (i.e. the data corresponding to the first row entry in B + mdt2+ j) from the data node 2, and obtains the data of course 1 and the achievement of course 1 corresponding to the school number 3 (i.e. the data corresponding to the first row entry in B + mdt0+ j) from the data node 0. After the data node 2 and the data node 0 reconstruct the distribution table of the table B, the corresponding steps are also executed according to the table entry features of themselves, which is not described in detail here.

And after acquiring the data required by the reconstructed table B distribution table, the data node adds the data to the corresponding table entry in the reconstructed table B distribution table, thereby completing the reconstruction of the table B distribution table. At this time, the control node can control the data node to query the course names of the three courses corresponding to each school number according to the distribution table of the table C and the distribution table of the reconstructed table B.

The summation of the distribution tables corresponding to the same logic data table on each data node can completely reflect the content of the logic data table.

An embodiment of the present invention provides a data distribution method, as shown in fig. 5, the method includes the following steps:

501. before data query, the control node sets a distribution table creation instruction according to a creation rule.

The distribution table creation instruction carries an identifier (Identity, abbreviated as ID) of a logical data table and a distribution column identifier of a selected distribution column, where the selected distribution column is a distribution column in the logical data table, and the logical data table is a logical data table created in the control node.

The ID of the logic data table is used for uniquely identifying the logic data table, and the distribution column identification is used for uniquely identifying the selected distribution column. The distribution table creation indication may carry one or more logical data table IDs and one or more distribution column identifiers, and when carrying a plurality of distribution column identifiers, the plurality of distribution column identifiers may be a plurality of distribution column identifiers in one logical data table, or a plurality of distribution column identifiers in a plurality of logical data tables.

Taking the distribution table B + mdt1+ j on data node 1 in fig. 4 as an example, B is the ID of table B, j is the distribution column ID with the academic number field as the distribution column, and mdt1 is the ID of the data node.

502. The control node sends a distribution table creation indication to the data node.

And the data nodes create the distribution table of the logic data table according to the logic data table ID and the distribution column identification in the creation indication of the distribution table, and complete data migration among the data nodes.

In the prior art, after a client inputs a query statement (i.e., after starting query), if a plurality of distribution tables participating in the combined query of the distribution columns are not distributed and stored according to the same distribution column, the corresponding logical data tables need to be redistributed on each data node according to the same distribution column, and data migration is performed among the data nodes, which may take a lot of query time and reduce query efficiency. The data distribution method provided by the embodiment of the invention can select the distribution column according to the creation rule in the data storage stage, and the data node is used for subsequent distribution column combined query based on the distribution table created by the selected distribution column. When the client inputs the query statement, the data nodes directly carry out the combined query of the distribution columns according to the plurality of distribution tables established according to the preset distribution columns, so that the query time can be saved, and the query efficiency can be improved.

Further, as a further extension of the embodiment shown in fig. 5, an embodiment of the present invention further provides a data distribution method, as shown in fig. 6, where the method includes the following steps:

601. before data query, the control node sets a distribution table creation instruction according to a statistical result creation rule of the logic data table in a preset period.

And the control node takes at least one field of the logic data table as a selected distribution column according to the statistical result of the data of the logic data table, and adds the distribution column identification of the selected distribution column and the ID of the logic data table to the distribution table creation indication.

The statistical objects of the control nodes are all created logical data tables in the parallel database system, namely table a, table B and table C in fig. 2. The statistical result comprises: in a preset period, the number of times of querying the logic data table, the proportion of the table entry data queried by the logic data table to the total table entry data of the logic data table, and the number of times of querying the distribution in the logic data table. The calling times of the logic data table are the times of the logic data table participating in the combined query of the distribution columns in a preset period; the proportion of the called table entry data of the logical data table to the total table entry data of the logical data table is that, in a preset period, the accumulated accessed amount of any row of table entry data in the logical data table accounts for the total table entry data of the logical data table, for example, a certain logical data table has three rows of table entry data in total, and the second threshold is 120%. If the first row entry data in the logical data table is queried 4 times in 5 minutes, the query quantity of the row entry data is 4 rows (cumulative value), and the query quantity of the row entry data accounts for 133% of the total entry data of the logical data table (4/3 ═ 1.33), and exceeds the second threshold value of 120%. Wherein, any row of table entry data is not limited to be queried by the distributed column union query. The number of times of querying the distribution column in the logic data table is that any field in the logic data table is queried as the distribution column in a preset period, and the query is not limited to the combined query of the distribution column.

Specifically, the method comprises the following steps:

A) when the number of times that the logical data table is queried exceeds a first threshold value within a preset period, the control node adds the ID of the logical data table and the distribution column identification of all fields (as distribution columns) in the logical data table to the distribution table creation indication.

B) In a preset period, when the proportion of the queried table entry data in the logical data table to the total table entry data in the logical data table exceeds a second threshold, the control node adds the ID of the logical data table and the distribution column identifier of all fields (as a distribution column) in the logical data table to the distribution table creation indication.

C) In a preset period, when the number of times of inquiring the distribution column in the logic data table exceeds a third threshold, the control node adds the ID of the logic data table and the distribution column identification of the distribution column, of which the number of times of inquiring exceeds the third threshold, in the distribution table creation indication.

In the embodiment of the present invention, the control node may set the distribution table creation indication according to any one of the three statistical results, for example:

when the number of times of querying the logic data table and the proportion of the table entry data queried by the logic data table to the total table entry data of the logic data table do not reach respective threshold values, the control node does not add the ID of the logic data table to the creation indication of the distribution table. When any one of the two conditions of the number of times of querying the logical data table or the proportion of the queried table entry data in the total table entry data of the logical data table reaches a corresponding threshold value, the control node adds the ID of the logical data table and the distribution column identification of all fields (as a distribution column) in the logical data table to the distribution table creation indication. When the number of times of querying the distribution column in the logic data table reaches a third threshold value, the control node adds the distribution column identification of the distribution column with the queried number reaching the third threshold value and the ID of the logic data table to the distribution table creation indication.

Optionally, the control node may also combine the three statistical results to serve as a basis for adding the logical data table ID and the distribution column identifier. For example, within a preset period of 5 minutes, if a certain logical data table satisfies the logical expressions (JoinTimes > 6) and (packed Lines percentage > 180%) and (CFrenqence > 8), the ID of the logical data table and the corresponding distribution column identifier are added to the distribution table creation indication. Wherein, (Join Times > 6) indicates that the number of Times of querying the logic data table is greater than 6, (cached Lines percentage > 180%) indicates that the ratio of the query quantity of a certain entry data in the logic data table to the total entry data of the logic data table is greater than 180%, (CFrenqence > 8) indicates that the number of Times of querying a certain distribution column in the logic data table is greater than 8, and indicates that the three decision conditions are in a and relationship, that is, the three decision conditions are simultaneously satisfied. Alternatively, the logical expression of a logical data table may be (Paccessed Lines percentage > 5/Join Times > 50%) and (CFrenqence > 6), where "/" represents an OR relationship, and two decision conditions are selected.

The statement that the control node controls the data node to create the distribution table is as follows:

the accented font is a newly added definition in the current standard SQL language, and is explained as follows:

the statement that the control node controls the data node to create the distribution table is as follows: [ Distribution on KEY column _ name [, column _ name, … ] … a11 owultipledistribution, where column _ name is Distribution column identification, column _ name [, column _ name, … ] indicates that a plurality of Distribution columns may be used as arguments of a Distribution algorithm, and allowMultipleDistribution indicates that creation of a Distribution table is permitted.

In addition, the control node can add the ID of the logical data table and the distribution column identifier of the selected distribution column of the logical data table to the distribution table creation instruction according to the creation instruction of the client. For example, the creation instruction received by the control node includes a logical data table ID selected by a querier or a database administrator and a distribution column identifier of at least one distribution column in the logical data table. The control node adds the logical data table ID in the create indication and the distribution column identification of the selected distribution column to the distribution table create indication.

Furthermore, in order to save the storage space of the data nodes, the control node may also delete the distribution table in the data nodes that is not queried or is rarely queried periodically. Specifically, the control node counts the number of times that each distribution table in the data node is queried, and if a distribution table with the queried number of times smaller than a fourth threshold exists in a preset period, the control node sends a distribution table deletion instruction to the data node to which the distribution table belongs, where the distribution table deletion instruction carries the distribution column identifier of the distribution column corresponding to the distribution table with the queried number of times smaller than the fourth threshold, and the ID of the logical data table to which the distribution column belongs. And after receiving the distribution table deleting indication, the data node deletes the corresponding distribution table according to the logic data table ID and the distribution column identification carried in the distribution table.

Further, to avoid the data node from repeatedly creating the distribution table, when the control node sets the distribution table creation instruction, it may further determine whether the data node has created the distribution table of the logical data table according to the ID of the logical data table and the distribution column identifier of the selected distribution column, where the ID of the logical data table is the ID of the logical data table added by the control node to the distribution table creation instruction, and the distribution column identifier of the selected distribution column is the distribution column identifier added by the control node to the distribution table creation instruction. If the data node has created the distribution table of the logical data table according to the ID of the logical data table and the distribution column identification of the selected distribution column, the control node does not add the ID of the logical data table and the distribution column identification of the selected distribution column to the distribution table creation indication.

602. The control node sends a distribution table creation indication to the data node.

And the control node sends a distribution table creation instruction to the data node, so that the data node creates a distribution table for the logic data table according to the ID, the distribution column identification and a preset distribution algorithm of the logic data table in the distribution table creation instruction.

Specifically, the distribution tables created for Table B are shown in the tables B + mdt1+ k, B + mdt2+ k, and B + mdt0+ k in FIG. 4. The data node may create a distribution table with one field of the logical data table as a distribution column (as shown in fig. 3), or may create a distribution table with two or more fields of the logical data table as distribution columns. For example, the school number field and the achievement field are summed as arguments of the modulo-3 algorithm. When a data node creates a distribution table based on a field (e.g., a school number field) of a logical data table, the data node may create the maximum number of distribution tables as the number of fields of the logical data table. When a distribution table is created according to two or more fields of the logical data table, the maximum number of the distribution tables which can be created by the data node is the permutation and combination of at least two fields of the logical data table, and the number of the distribution tables which can be created is larger than the number of the fields in the logical data table.

In the embodiment of the present invention, the distribution algorithm used when the data nodes distribute or create the distribution table includes, but is not limited to, a hash algorithm, a range algorithm, and a round robin algorithm.

After the data node completes creation of the distribution table according to the distribution table creation instruction, data included in the newly created distribution table but not stored in the data node needs to be acquired from other data nodes, so that data migration between the data nodes is completed.

For example, in fig. 4, after creating the distribution table B + mdt1+ k, data node 1 obtains all the entry data in the B + mdt2+ j table and the B + mdt0+ j table sent by data node 2 and data node 0.

Further, to reduce the data migration amount, the data node 1 may also obtain only the data of the first row entry in B + mdt2+ j sent by the data node 2, and the data of the first row entry in B + mdt0+ j sent by the data node 0.

After the data node creates the distribution table and performs data migration, when the client initiates a query request, the control node sends a query instruction to the data node so that the data node performs combined query of the distribution list.

When performing the distribution column joint query, the distribution table with smaller query cost including but not limited to the data migration amount may be selected for query. For example, when the client executes the query statements Select stu _ id, coarse _ name from B, C where B: coarse _ id ═ C: when course _ id is detected, the data tables participating in the joint query of the distribution columns on the data node 1 are B + mdt1+ k and C + mdt1+ k, and the two tables are distributed according to the same distribution column (course identification k). The data node 1 can search and report the achievements of the course 1 corresponding to the three school numbers to the control node. Similarly, the data node 2 and the data node 0 can search and report the achievements of the courses 2 and 3 corresponding to the three school numbers to the control node. And the control node summarizes and feeds back the query results reported by the three data nodes to the client to complete the query.

The embodiment of the present invention is described by taking an example that two data tables participate in the joint query of the distribution columns, and in practical applications, the number of the data tables participating in the joint query of the distribution columns may be more than three.

In the prior art, the distribution table is established and data migration is performed according to the query condition after the client inputs the query statement (i.e., after the query is started), and data migration in the query process occupies a large amount of query time, thereby reducing the query efficiency. The data distribution method provided by the embodiment of the invention can select the distribution column according to the statistical result of the preset data or the indication of the client in the data storage stage, and the data node is used for subsequent distribution column combined query based on the distribution table created by the selected distribution column. When the client inputs the query statement, the data nodes directly carry out the combined query of the distribution columns according to the plurality of distribution tables established according to the preset distribution columns, so that the query time can be saved, and the query efficiency can be improved.

In an application scenario of the embodiment of the present invention, as shown in fig. 7, the control node may instruct the data node to create the distribution table of table B again according to the client instruction or the statistical result. Specifically, 701, the client sends a distribution table creation instruction to the control node, and the instruction is to create a distribution table for table B in the parallel database system. 702. And the control node respectively sends a distribution table creation instruction to the three data nodes, wherein the distribution table creation instruction carries the ID of the table B and the identification of a distribution column (course identification field). 703. The three data nodes create a distribution table from the distribution table indicating the creation table B. 704. And the control node respectively sends data migration instructions to the three data nodes. 705. And the data node performs data migration according to the data migration indication. 706. And the data node sends a creation success message to the control node. 707. And after receiving the creation success message sent by the data node, the control node sends the creation success message to the client.

The application scenario shown in fig. 7 may create a distribution table on each data node before a client queries and complete data migration between data nodes. Since the distribution table is created before the client queries, the distribution column for creating the distribution table cannot be determined according to the keywords in the query statement, so that the distribution table is created in the application scenario shown in fig. 7 according to the statistical result of the user query data or the logical data table ID and the distribution column set in advance by the client, and the created distribution table is used for subsequent distribution column join query. The steps of creating the distribution table and data migration are transferred to the client side before query, so that the time delay of creating the distribution table and data migration in the query process to query is saved, and the query efficiency can be improved.

In the embodiment of the present invention and the application scenario shown in fig. 7, the data migration is specifically that the source data node copies data stored by itself, and transmits the copied data to the destination data node. Data stored in the source data node still exists after data migration, and in practical application, the data migration is a known technical means of those skilled in the art, and this is not described in more detail in the embodiments of the present invention.

Referring to implementation of the method embodiment shown in fig. 6, an embodiment of the present invention further provides a control node, which is used to implement the method embodiment shown in fig. 6. As shown in fig. 8, the control node includes: a processing unit 81, a transmitting unit 82, and a receiving unit 83, wherein,

the processing unit 81 is configured to set a distribution table creation instruction according to a creation rule before data query, where the distribution table creation instruction carries an ID of a logical data table and a distribution column identifier of a selected distribution column, where the selected distribution column is a distribution column in the logical data table, and the logical data table is a logical data table created in the control node;

the sending unit 82 is configured to send the distribution table creation instruction set by the processing unit 81 to a data node, so that the data node creates the distribution table of the logical data table according to the distribution table creation instruction.

Further, the processing unit 81 is specifically configured to: and counting the data of the logic data table in a preset period to obtain a statistical result, and adding the ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication according to the statistical result.

Further, the receiving unit 83 is configured to receive a creation instruction of a client, where the creation instruction carries an ID of the logical data table and a distribution column identifier of the selected distribution column;

the processing unit 81 is further specifically configured to:

adding the ID of the logical data table and the distribution column identifier of the selected distribution column carried in the creation indication received by the receiving unit 83 to the distribution table creation indication.

Further, the processing unit 81 is further specifically configured to: at least one of the following data is counted in a preset period: the number of times of querying the logic data table, the proportion of the queried table entry data in the logic data table in the total table entry data of the logic data table, and the number of times of querying the distribution in the logic data table;

when the number of times that the logical data table is queried exceeds a first threshold value, adding the ID of the logical data table and the distribution column identification of all distribution columns in the logical data table into the distribution table creation indication;

when the proportion of the inquired table entry data in the logical data table to the total table entry data in the logical data table exceeds a second threshold value, adding the ID of the logical data table and the distribution column identifiers of all distribution columns in the logical data table to the creation indication of the distribution table;

when the number of times of inquiring the distribution column in the logic data table exceeds a third threshold value, adding the ID of the logic data table and the distribution column identification of the distribution column, of which the number of times of inquiring exceeds the third threshold value, in the distribution table creation indication.

Further, the sending unit 82 is further configured to: when the processing unit 81 counts that the number of times of querying a distribution table corresponding to a distribution list in the logical data table is smaller than a fourth threshold, sending a distribution table deletion instruction to the data node, where the distribution table deletion instruction carries a distribution list identifier of the distribution list corresponding to the distribution table whose queried number of times is smaller than the fourth threshold and an ID of the logical data table, and the distribution table deletion instruction is used to instruct the data node to delete the distribution table whose queried number of times is smaller than the fourth threshold.

Further, the processing unit 81 is further configured to: before the control node adds the ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication, judging whether the data node creates the distribution table of the logic data table according to the ID of the logic data table and the distribution column identification of the selected distribution column;

when the data node does not create the distribution table of the logic data table according to the ID of the logic data table and the distribution column identification of the selected distribution column, adding the ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication;

when the data node has created the distribution table of the logical data table according to the ID of the logical data table and the distribution column identifier of the selected distribution column, not adding the ID of the logical data table and the distribution column identifier of the selected distribution column to the distribution table creation indication.

The control node provided by the embodiment of the invention can select the distribution column according to the statistical result of the preset data or the indication of the client in the data storage stage, and the data node is used for subsequent distribution column combined query based on the distribution table created by the selected distribution column. When the client inputs the query statement, the data nodes directly carry out the combined query of the distribution columns according to the plurality of distribution tables established according to the preset distribution columns, so that the query time can be saved, and the query efficiency can be improved.

Further, the embodiment of the present invention also provides a data distribution system, as shown in fig. 9, the system includes a control node 91 and at least three data nodes 92, wherein,

the control node 91 is configured to set a distribution table creation instruction according to a creation rule before data query, where the distribution table creation instruction carries an identifier ID of a logical data table and a distribution column identifier of a selected distribution column, where the selected distribution column is a distribution column in the logical data table, and the logical data table is a created logical data table in the control node, and send the distribution table creation instruction to the data node 92.

The data node 92 is configured to receive the distribution table creation instruction sent by the control node 91 before data query, and create the distribution table of the logical data table according to the distribution table creation instruction.

The data distribution system provided by the embodiment of the present invention is described by taking three data nodes 92 as an example, and the number of the data nodes 92 is not limited in practical application.

The data distribution system provided by the embodiment of the invention can select the distribution column by the control node according to the statistical result of the preset data or the indication of the client in the data storage stage, and the data node is used for subsequent distribution column combined query based on the distribution table created by the selected distribution column. When the client inputs the query statement, the data nodes directly carry out the combined query of the distribution columns according to the plurality of distribution tables established according to the preset distribution columns, so that the query time can be saved, and the query efficiency can be improved.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method of data distribution, comprising:

the control node sends the distribution table creation instruction to a data node so that the data node creates a distribution table of the logic data table according to the distribution table creation instruction;

the control node sets a distribution table creation instruction according to a creation rule, and specifically includes: the control node judges whether the data node creates a distribution table of the logic data table according to the identification ID of the logic data table and the distribution column identification of the selected distribution column; when the data node does not create the distribution table of the logic data table according to the identification ID of the logic data table and the distribution column identification of the selected distribution column, the control node adds the identification ID of the logic data table and the distribution column identification of the selected distribution column to the distribution table creation indication; when the data node has created the distribution table of the logical data table according to the identification ID of the logical data table and the distribution column identification of the selected distribution column, the control node does not add the identification ID of the logical data table and the distribution column identification of the selected distribution column to the distribution table creation indication.

2. The method according to claim 1, wherein the control node sets a distribution table creation instruction according to a creation rule, specifically comprising:

the control node counts the data of the logic data table in a preset period to obtain a statistical result, and adds the identification ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication according to the statistical result; or,

the control node receives a creation instruction of a client, the creation instruction carries the identification ID of the logic data table and the distribution column identification of the selected distribution column, and the control node adds the identification ID of the logic data table and the distribution column identification of the selected distribution column carried in the creation instruction to the distribution table creation instruction.

3. The method according to claim 2, wherein the counting of the data of the logical data table by the control node in a preset period specifically includes:

in a preset period, the control node counts at least one of the following data: the number of times of querying the logic data table, the proportion of the queried table entry data in the logic data table in the total table entry data of the logic data table, and the number of times of querying the distribution in the logic data table;

adding the identifier ID of the logical data table and the distribution column identifier of the selected distribution column to the distribution table creation instruction according to the statistical result specifically includes:

when the number of times that the logical data table is queried exceeds a first threshold value, the control node adds the identification ID of the logical data table and the distribution column identifications of all distribution columns in the logical data table to the distribution table creation indication; and/or the presence of a gas in the gas,

when the proportion of the inquired table entry data in the logical data table to the total table entry data in the logical data table exceeds a second threshold, the control node adds the identification ID of the logical data table and the distribution column identifications of all distribution columns in the logical data table to the distribution table creation indication; and/or the presence of a gas in the gas,

when the number of times of inquiring the distribution column in the logic data table exceeds a third threshold, the control node adds the identification ID of the logic data table and the distribution column identification of the distribution column, the number of times of inquiring the distribution column in the logic data table exceeds the third threshold, to the creation indication of the distribution table.

4. The method of claim 3, wherein when the number of times that the distribution table corresponding to the distribution column in the logical data table is queried is less than a fourth threshold, the method further comprises:

and the control node sends a distribution table deletion instruction to the data node, wherein the distribution table deletion instruction carries the distribution column identifier of the distribution column corresponding to the distribution table with the queried number of times smaller than the fourth threshold and the identifier ID of the logic data table, and the distribution table deletion instruction is used for indicating the data node to delete the distribution table with the queried number of times smaller than the fourth threshold.

5. A control node, comprising:

a sending unit, configured to send the distribution table creation instruction set by the processing unit to a data node, so that the data node creates a distribution table of the logical data table according to the distribution table creation instruction;

the processing unit is specifically configured to determine whether the data node has created a distribution table of the logical data table according to the identifier ID of the logical data table and the distribution column identifier of the selected distribution column; when the data node does not create the distribution table of the logic data table according to the identification ID of the logic data table and the distribution column identification of the selected distribution column, adding the identification ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication; when the data node has created the distribution table of the logical data table according to the identification ID of the logical data table and the distribution column identification of the selected distribution column, not adding the identification ID of the logical data table and the distribution column identification of the selected distribution column to the distribution table creation indication.

6. The control node according to claim 5, wherein the processing unit is specifically configured to:

and counting the data of the logic data table in a preset period to obtain a statistical result, and adding the identification ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication according to the statistical result.

7. The control node according to claim 5, further comprising a receiving unit, configured to receive a creation instruction of a client, where the creation instruction carries an identifier ID of the logical data table and a distribution column identifier of the selected distribution column;

the processing unit is further specifically configured to:

and adding the identification ID of the logic data table and the distribution column identification of the selected distribution column carried in the creation indication received by the receiving unit into the distribution table creation indication.

8. The control node of claim 6, wherein the processing unit is further specifically configured to:

at least one of the following data is counted in a preset period: the number of times of querying the logic data table, the proportion of the queried table entry data in the logic data table in the total table entry data of the logic data table, and the number of times of querying the distribution in the logic data table;

when the number of times that the logical data table is queried exceeds a first threshold value, adding the identification ID of the logical data table and the distribution column identifications of all distribution columns in the logical data table into a distribution table creation indication;

when the proportion of the inquired table entry data in the logic data table to the total table entry data in the logic data table exceeds a second threshold value, adding the identification ID of the logic data table and the distribution column identifications of all distribution columns in the logic data table to the creation indication of the distribution table;

when the number of times of inquiring the distribution column in the logic data table exceeds a third threshold value, adding the identification ID of the logic data table and the distribution column identification of the distribution column, the number of times of inquiring the distribution column in the logic data table exceeds the third threshold value, into the creation indication of the distribution table.

9. The control node according to claim 8, wherein the sending unit is further configured to: when the processing unit counts that the number of times of querying a distribution table corresponding to a distribution list in the logical data table is smaller than a fourth threshold, sending a distribution table deletion instruction to the data node, where the distribution table deletion instruction carries a distribution list identifier of the distribution list corresponding to the distribution table with the queried number of times smaller than the fourth threshold and an identifier ID of the logical data table, and the distribution table deletion instruction is used to instruct the data node to delete the distribution table with the queried number of times smaller than the fourth threshold.

10. A system for data distribution, the system comprising a control node and a data node, the system comprising:

the control node is configured to set a distribution table creation instruction according to a creation rule before data query, where the distribution table creation instruction carries an identifier ID of a logical data table and a distribution column identifier of a selected distribution column, where the selected distribution column is a distribution column in the logical data table, and the logical data table is a created logical data table in the control node and sends the distribution table creation instruction to the data node;

the data node is configured to receive the distribution table creation instruction sent by the control node before data query, and create the distribution table of the logical data table according to the distribution table creation instruction;

the control node is specifically configured to determine whether the data node has created a distribution table of the logical data table according to the identifier ID of the logical data table and the distribution column identifier of the selected distribution column; when the data node does not create the distribution table of the logic data table according to the identification ID of the logic data table and the distribution column identification of the selected distribution column, adding the identification ID of the logic data table and the distribution column identification of the selected distribution column into the distribution table creation indication; when the data node has created the distribution table of the logical data table according to the identification ID of the logical data table and the distribution column identification of the selected distribution column, not adding the identification ID of the logical data table and the distribution column identification of the selected distribution column to the distribution table creation indication.