CN111782654A

CN111782654A - Method for storing data in distributed database in partition mode

Info

Publication number: CN111782654A
Application number: CN202010617993.2A
Authority: CN
Inventors: 张豪; 季业; 刘阳; 刘壮; 王世航; 陈明松
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2020-07-01
Filing date: 2020-07-01
Publication date: 2020-10-16

Abstract

The invention provides a method for storing data in a distributed database in a partitioned mode, and belongs to the field of distributed databases. The method records the query history of the user by introducing a structure TABLE _ RELATION, and is convenient and effective. On the basis of taking the frequency as the basis of table dump, further analyzing by using a connection graph between tables, and storing all node tables of the strongly-connected subgraph in the same partition or node; and a few key intermediate nodes in the graph are stored redundantly, so that the query efficiency is ensured, and the safety and reliability of key information are also ensured. According to the method, through the partition dumping of the table, a plurality of query operations which are originally required to be performed on the partition or the node can be completed in one partition or node, so that the query efficiency is improved, and the query time is reduced. The method does not need to substantially change the existing database system, and is convenient to implement and deploy.

Description

Method for storing data in distributed database in partition mode

Technical Field

The invention belongs to the technical field of distributed databases, and particularly relates to a method for storing data in a distributed database in a partitioned mode.

Background

The adding, deleting, modifying and checking are the most common operations of the database, and when the series of operations are carried out, the process can not avoid the need of accessing the data of the table in the database, and in many cases, the process can not only access the data of one table. For example, querying the relevant information of a student and the school where the student is located needs to connect the two tables of student and unity and return a result meeting the condition.

For a traditional database, the perhaps most important factor affecting the efficiency of the connections between different tables is the cartesian product of the two tables; but for distributed databases, the communication time between partitions that are far apart also plays an important role.

For a distributed database, it is common to include multiple storage nodes, each storing different data, without regard to redundancy. When data is queried through a database, in many cases, the data is not only queried for one table, but also for multiple tables at the same time. In the latter case, the partitions (nodes) stored in different tables are uncertain, and it is certainly good if the tables are stored in the same partition (node), but if the tables are stored in different partitions (nodes), the tables need to be queried across partitions (nodes), which results in slow and long query efficiency. Therefore, there is a need to optimize this problem and improve the query efficiency of distributed databases.

Disclosure of Invention

The technical task of the invention is to solve the defects of the prior art and provide a method for storing data in a distributed database in a partitioning manner, so that the efficiency of the distributed database in query execution is improved, and the time for query is reduced.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for storing data in a distributed database in a partition mode is characterized in that the method measures the strength of the relation between tables based on the frequency of table-to-table connection; then introducing a processing method of the undirected graph, and further partitioning the table.

Preferably, the scheme introduces a structure, and records the connection and relation between tables and partitions.

The solution is preferably such that the pressure in the tank,

in one aspect, the frequency of table connection is used to represent the strength of the relationship between a table and a partition (node) of a partitioned storage, specifically:

introducing a TABLE _ relative structure, and recording the RELATION strength between different TABLEs and each partition by maintaining the connection RELATION TABLE _ relative between one TABLE and the partition (node);

each operation related to table connection causes the record of the table related to the connection to change in TEABLE _ RELATION; in the initial state, the state of TABLE _ relative changes through certain DML operations;

then, the TABLE is stored in different partitions (nodes) according to the association strength represented by the connection frequency through the record of a TABLE _ RELATION TABLE;

on the other hand, a connection relation graph between tables is generated according to the connection relation between the tables, and all nodes (representative tables) in the strongly-connected subgraph in the connection relation graph are stored in the same partition.

Preferably, in the connection relationship diagram, the node represents the table, and the edge represents whether the table is connected or not.

The scheme preferably prioritizes the latter for tables that satisfy both aspects.

The solution is preferably such that the pressure in the tank,

taking the strength of the connection between the tables and the partitions as an index, and if the partition with the strongest connection of one table is changed, considering the partition dumping of the table;

from the aspect of the connection relation graph, when a strongly connected subgraph appears in the graph, all node tables in the same strongly connected subgraph should be stored in the same partition storage node.

The scheme preferably takes into account, if storage is allowed, that the few tables are stored redundantly in different partitions for the tables acting as intermediate nodes in the plurality of strongly connected subgraphs.

Preferably, when a table is found to need to be subjected to partition dumping, a proper time needs to be selected for the partition dumping, so that the normal production activity is not influenced or the influence on the production activity is reduced as much as possible.

Compared with the prior art, the method for storing the data in the distributed database in the partitioned mode has the following beneficial effects that:

1. the method is convenient and effective by introducing a structure TABLE _ RELATION to record the query history of the user.

2. On the basis of taking the frequency as the basis of table dump, further analyzing by using a connection graph between tables, and storing all node tables of the strongly-connected subgraph in the same partition or node; and a few key intermediate nodes in the graph are stored redundantly, so that the query efficiency is ensured, and the safety and reliability of key information are also ensured.

3. According to the method, through the partition dumping of the table, a plurality of query operations which are originally required to be performed on the partition or the node can be completed in one partition or node, so that the query efficiency is improved, and the query time is reduced.

4. The method does not need to substantially change the existing database system, and is convenient to implement and deploy.

Drawings

In order to more clearly describe the working principle of the method for partitioned storage of data in a distributed database according to the present invention, a schematic diagram is attached for further explanation.

FIG. 1 is a connection diagram of a first bid in accordance with an embodiment of the present invention;

FIG. 2 is a connection diagram of 7 tables in a database according to an embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to fig. 1 and 2 in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention relates to a method for storing data in a distributed database in a partition mode, which is based on the frequency of table-to-table connection and the strength of the relationship between a measurement table and the table; then introducing a processing method of the undirected graph, and further partitioning the table.

With reference to fig. 1, a structure is introduced to record the connection and relationship between tables and partitions, and the connection and relationship between tables and partitions mainly has the following two aspects:

In the connection relation graph, a node represents a table, and an edge represents whether the table is connected or not.

Wherein for a table that satisfies both aspects, the latter is prioritized.

The method comprises the following steps that the connection strength between tables and partitions is used as an index, and if the partition with the strongest connection of one table is changed, the partition dumping of the table is considered;

For the tables serving as intermediate nodes in the multiple strongly connected subgraphs, redundant storage on different partitions should be performed on the few tables in consideration of storage permission.

Through the data information in the TABLE _ relative TABLE and the information of the graph formed by the connection relationship between the TABLEs, the data stored in the database can be transferred and stored between the partitions, but the database needs to provide external services, and the update and storage of the TABLE partitions (nodes) cannot be performed at any time, so that a time period with small service quantity, such as 12 am every day, can be selected for performing the work of dumping the TABLE partitions (nodes).

Example one

Assume that the database has three partitions, partition A, B, C. There are 7 tables in the current database, tables t1, t2, t3, t4, t5, t6, t7, t 8. t1 and t2 are located in partition A, t3, t4 and t8 are located in partition B, and t5, t6 and t7 are located in partition C.

The general structure of TABLE _ RELATION is as follows:

CREATE TABLE TABLE table_relation (

table_name TEXT,

current_partition TEXT,

weight_with_partA BIGINT,

weight_with_partB BIGINT,

weight_with_partC BIGINT

}

for a sentence

SELECT * FROM t1,t2;

SELECT * FROM t1,t3;

SELECT * FROM t2,t3;

SELECT * FROM t3,t4;

SELECT * FROM t4,t5;

SELECT * FROM t4,t6;

SELECT * FROM t5,t6;

SELECT * FROM t4,t7;

SELECT * FROM t8,t7;

Each time each TABLE is connected with other TABLEs, the partition corresponding to the other TABLE has a weight of +1, and after the command, the data of TABLE _ RELATION is updated as follows:

the connection diagram of the corresponding table is shown in fig. 2.

According to the data in TABLE _ correlation, the partitioning result obtained by the frequency method should be a (t1, t 2), B (t3, t7, t8), C (t4, t5, t6), but according to the figure, t1, t2, t3 should be divided into one partition, and t4, t5, t6 should be divided into one partition. In summary, combining the two approaches, the final table partitioning results are a (t1, t2, t3), B (t7, t8), C (t4, t5, t 6).

If the partition result is not changed, the data of the database is subjected to partition dump at a proper time. In the subsequent query process, the probability and the situation of the cross-partition search are reduced.

It is to be understood that the phraseology and terminology employed herein are for the purpose of description and that the present method is not to be regarded as limited to such terminology and terminology. The use of such terms and expressions is not intended to exclude any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications may be made within the scope of the claims. Other modifications, variations, and alternatives are also possible. Accordingly, the claims should be looked to in order to cover all such equivalents.

Also, it should be noted that while the method has been described with reference to the specific embodiments, those skilled in the art will recognize that the above embodiments are merely illustrative of the method and that various changes or substitutions of equivalents may be made without departing from the spirit of the method, and therefore, it is intended that all changes and modifications to the above embodiments be within the scope of the appended claims.

Claims

1. A method for storing data in a distributed database in a partition mode is characterized in that the method measures the strength of the relation between tables based on the frequency of table-to-table connection; then introducing a processing method of the undirected graph, and further partitioning the table.

2. The method according to claim 1, wherein a structure is introduced to record the connection and association between tables and partitions.

3. The method of claim 2, wherein the data is stored in the distributed database in a partitioned manner,

4. The method according to claim 3, wherein in the connection relationship diagram, the node represents the table, and the edge represents whether the table is connected to the table or not.

5. A method according to claim 3, wherein tables satisfying both aspects are prioritized.

6. The method of claim 3, wherein the data is stored in the distributed database in a partitioned manner,

7. A method according to claim 3, characterized in that, for the tables acting as intermediate nodes in the strongly connected subgraphs, if the storage allows, it should be considered that the few tables are redundantly stored in different partitions.

8. The method as claimed in claim 4, wherein when it is found that there is a table to be subjected to partition dumping, it is required to select an appropriate time for the table to be subjected to partition dumping, so as to avoid affecting normal production activities or minimize the impact on the production activities.