CN107451229B

CN107451229B - Database query method and device

Info

Publication number: CN107451229B
Application number: CN201710605395.1A
Authority: CN
Inventors: 孙乔; 付兰梅; 邓卜侨; 孙雷; 王志强; 马慧远; 刘炜; 崔伟
Original assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Beijing Electric Power Co Ltd; Beijing China Power Information Technology Co Ltd; Beijing Fibrlink Communications Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Beijing Electric Power Co Ltd; Beijing China Power Information Technology Co Ltd; Beijing Zhongdian Feihua Communication Co Ltd
Priority date: 2017-07-24
Filing date: 2017-07-24
Publication date: 2020-04-14
Anticipated expiration: 2037-07-24
Also published as: CN107451229A

Abstract

The invention discloses a database query method and a database query device, which comprise the following steps: constructing an entity cluster according to entity information in the data, establishing a mapping relation between the entity cluster and a storage node, and storing the data to the storage node according to the mapping relation; reading data which belong to the same entity cluster but are positioned on different storage nodes into a memory of one storage node according to the query instruction; migrating the dimension data among the storage nodes according to the entity information in the entity cluster; the query instruction is executed. The optimized database construction and database query method is realized, and the database query performance is improved.

Description

Database query method and device

Technical Field

The present invention relates to database technologies, and in particular, to a database query method and apparatus.

Background

The log is a record file or a file collection for recording system operation events, and can be divided into an event log and a message log. The method has important roles in processing historical data, tracing diagnosis problems, understanding system activities and the like. Valuable information is contained in the log data, and the log data can be timely and effectively analyzed, so that considerable commercial value can be brought. For example, by analyzing server log data, we can analyze the cause of the failure. By analyzing the log data of the E-commerce website, the change of the recent browsing/purchasing behavior of the user can be known, and further personalized recommendation is carried out on the user.

In the existing query technology optimization scheme, data storage is only divided according to a simple hash fragmentation or range fragmentation strategy, entity information contained in data is not considered, and query performance aiming at the entity information is low directly, so that optimization is difficult, and user requirements cannot be met. In addition, in the aspect of query optimization, the universal optimization is performed based on rules or cost models, and the optimization is not thorough, so that the query performance of the database is influenced.

Therefore, how to provide a database query method and apparatus that optimize and improve query performance becomes a technical problem to be solved urgently.

Disclosure of Invention

In view of this, the present invention provides a database query method and apparatus, which implement an optimized database construction and database query method and improve database query performance.

Based on the above object, the present invention provides a database query method, which includes the steps of:

constructing an entity cluster according to entity information in the data, establishing a mapping relation between the entity cluster and a storage node, and storing the data to the storage node according to the mapping relation;

reading data which belong to the same entity cluster but are positioned on different storage nodes into a memory of one storage node according to the query instruction;

migrating the dimension data among the storage nodes according to the entity information in the entity cluster;

the query instruction is executed.

The method of the present invention, wherein,

the entity information includes: one or more entity information.

The method of the present invention, wherein,

the entity cluster and the storage node establish a mapping relation, and the step of storing data to the storage node according to the mapping relation further comprises the following steps: and storing the data of one entity cluster in a plurality of storage nodes, or storing the data of a plurality of entity clusters in one storage node.

The method of the present invention, wherein,

the step of reading data belonging to the same entity cluster but located on different storage nodes into the memory of one storage node further comprises: and reading the data which belong to the same entity cluster but are positioned on different storage nodes into the memory of one storage node which stores the most data of the entity cluster.

The method of the present invention, wherein,

the dimension data is one item of entity information in the entity cluster.

The present invention further provides a database query device based on the above object, wherein the database query device comprises:

the database construction module is used for constructing an entity cluster according to entity information in the data, establishing a mapping relation between the entity cluster and the storage node, and storing the data to the storage node according to the mapping relation;

the storage node selection module is used for reading data which belong to the same entity cluster but are positioned on different storage nodes into a memory of one storage node according to the query instruction;

and the optimization query module is used for migrating the dimension data among the storage nodes according to the entity information in the entity cluster and executing a query instruction.

The apparatus of the present invention, wherein,

the database construction module is further configured to construct an entity cluster according to one or more entity information in the data, establish a mapping relationship between the entity cluster and the storage node, and store the data to the storage node according to the mapping relationship.

The apparatus of the present invention, wherein,

the database construction module is further configured to construct entity clusters according to entity information in the data, and store data of one entity cluster in a plurality of storage nodes, or store data of a plurality of entity clusters in one storage node.

The apparatus of the present invention, wherein,

the storage node selection module is further configured to read data belonging to the same entity cluster but located on different storage nodes into a memory of one storage node storing the most data of the entity cluster according to the query instruction.

The apparatus of the present invention, wherein,

the optimization query module is further used for transmitting one entity in the entity cluster according to the entity information in the entity cluster

And migrating the body information among the storage nodes and executing the query instruction.

From the above, according to the database query method and apparatus provided by the present invention, the entity cluster is constructed according to the entity information in the data, the mapping relationship is established between the entity cluster and the storage node, and the data is stored in the storage node according to the mapping relationship; reading data which belong to the same entity cluster but are positioned on different storage nodes into a memory of one storage node according to the query instruction; then, migrating the dimension data among the storage nodes according to the entity information in the entity cluster; the query instruction is executed. The database optimization construction and optimization query based on the entity cluster are realized, and the query performance is improved. Further, when data are loaded to the memory nodes for subsequent processing, the characteristics of entity-oriented query are considered, the data are loaded from each storage node to the local memory as much as possible, and network data exchange of the data in the loading process is reduced. In addition, when connection is carried out, data distribution is carried out based on the information of the entity clustering, and the purpose is to reduce the data exchange on the network and accelerate the connection process.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

The basic technical principle of the invention is that firstly, an entity is divided into a series of entity clusters according to a certain mapping relation, and then data is distributed to different entity clusters according to entity id. The entity cluster has a mapping to the storage node, data of different entity clusters are loaded to different storage nodes, and after a certain time, the mapping from the entity cluster to the storage node can be adjusted so as to balance load. In order to process data at high speed, our solution operates on the basis of a memory data set. Data on a disk needs to be loaded into a memory data set firstly, and the data are loaded into the memory data set from local as much as possible by combining a partitioning mechanism based on an entity cluster and a condition of entity-oriented query, so that network data exchange overhead is reduced. The inter-table connection is a costly operation. When the connection operation between tables is carried out, data distribution is carried out based on the information of the entity clustering, preparation is made for the downstream connection operation, the data exchange on the network is reduced, and the connection process is accelerated. In particular, when a large table and a small table are connected, a table containing a large amount of data is held in each storage node as much as possible, and a table having a small amount of data can be exchanged appropriately.

Example one

Referring to fig. 1, a flowchart of a database query method according to an embodiment of the present invention specifically includes the following steps:

step 101: constructing an entity cluster according to entity information in the data, establishing a mapping relation between the entity cluster and a storage node, and storing the data to the storage node according to the mapping relation; in this step, the entity information includes one or more entity information. In addition, the step of establishing a mapping relationship between the entity cluster and the storage nodes, and the step of storing data to the storage nodes according to the mapping relationship is to store the data of one entity cluster in a plurality of storage nodes, or store the data of a plurality of entity clusters in one storage node.

For example:

the log data records event information about the entity. For example, in an e-commerce web log, the entities described by the log record are users and goods. Where the user is the master entity and the goods are the slave entities. The following discussion will be based on the master entity being developed, and the processing of the slave entities being similar. In large data processing applications, an important principle is to use space for time, i.e. data can be stored in multiple copies. We can employ a policy that the master entity based log split is kept in 2 copies, the slave entity based log split is kept in 1 copy, and a total of 3 copies. A query directed to a master entity will be directed to a replica partitioned based on the master entity, while a query directed to a slave entity will be directed to a replica partitioned based on the slave entity.

On an Entity basis, we organize entities into Entity clusters (Entity fibers for short). The mapping relationship from the entity to the entity cluster can have a certain practical meaning, and can also be mapped through a Hash function or a Range function. For example, in mobile communication applications, the call records may be divided according to the intensive procedures of calls of users in different geographic areas. The users in a certain area communicate more frequently, and the users in the area can be divided into a plurality of entity clusters. The communication quantity of users in a certain area is very small, and the users in the area and other users in similar areas can be combined into a physical cluster. The entity cluster division tries to make the log data of each entity cluster to be received by a Loader (Loader) more balanced by considering the inclined characteristic of distribution when the log data is generated.

After the entities are divided into entity clusters, log information about the entities can be divided according to the information of the entity clusters. For example, user 1 and user 2 belong to fiber1, user 3 and user 4 belong to fiber2, and so on. Based on these information, the journal information of user 1 and user 2 is divided into Partition1, and the journal information of user 3 and user 4 is divided into Partition 2. The log data for each partition is organized in data blocks. Data blocks 11 and block12, such as partition1, contain log information about user 1 and user 2, but occur at different time periods. Here, without causing confusion, we denote by entity cluster three concepts, the entity cluster itself, the corresponding data partition of the entity cluster, and a series of data blocks of the data partition, respectively. As shown in table one:

table one: mapping relation from log data to entity cluster to storage node

Mapping an Entity to an Entity cluster can speed up the Query (Entity Central Query) about the Entity, and only the Entity cluster where the Entity is located needs to be scanned. The query of log data generally has a time range condition, so the query range can be narrowed to a few data blocks. Of course, when the query is only time-bounded and does not specify an entity, more data blocks need to be scanned.

In this step, the mapping relationship from the entity cluster to the storage Node needs to be adjusted once every time, so as to ensure that each Fiber is mainly stored (primary copy) to a certain Data Node in this time, and is stored (primary copy) to another Data Node after a period of time. This Mapping adjustment is called Mapping Shuffle. The mapping relation is adjusted regularly, so that Data nodes which are particularly busy are avoided, and the load balance of Data loading is realized.

Step 102: reading data which belong to the same entity cluster but are positioned on different storage nodes into a memory of one storage node according to the query instruction; in this step, data belonging to the same entity cluster but located on different storage nodes is read into the memory of one storage node storing the most data of the entity cluster.

For example:

in order to query the data on the hard disk, the data must be entered into the memory. Our query processing engine will perform subsequent operations based on the in-memory dataset. The memory data set is composed of partitions, each of which is on a storage node, i.e., memory. If the data on the hard disk is simply mapped to the memory data set according to the hash operation, a relatively large network transmission overhead may be caused. As shown in table two:

table two: inquiring about the distribution of the related data blocks (data blocks on three storage nodes of n1, n2, n3, etc.)

For example, according to the mapping of the entity cluster to the storage node, in a certain time period, a data Loader (Loader) loads a plurality of data blocks on three storage nodes, i.e., n1, n2, n3, and the like. On storage node n1, there are 10 data blocks containing the data of fiber1 and fiber2, and 2 data blocks containing the data of fiber3 and fiber 4. On storage node 2, there are 13 data blocks containing the data of fiber3 and fiber4, and 2 data blocks containing the data of fiber5 and fiber 6. On storage node n3, there are 11 data blocks containing the data of fiber5 and fiber6, and two data blocks containing the data of fiber1 and fiber 2. This is the case because we regularly adjust the mapping of entity clusters to storage nodes, otherwise, on storage node 1, there are only the data blocks of fiber1 and fiber 2.

We have devised a partition (Partitioner) based on physical clusters, which partitions data when it is loaded from a hard disk into a memory. Partitioner works by first searching out a list of data blocks containing data relevant to a query using meta-information and then partitioning the data according to data similar to table 2 computed from the meta-information. For example, for fiber1, where most of its data blocks are on the n1 storage node and only a few of its data blocks are on the n3 storage node, it partitions fiber1 into the memory dataset of storage node n1, so that the data that needs to be exchanged over the network is only the partial data (small amount of data) of fiber1 on the n3 storage node. It also partitions the fiber2 data into the memory data set of the n1 storage node, according to the same principles. If the data is not divided by using the above information, it is possible to divide the data of the fiber1 into the memory dataset partitions of the n2 storage node, in this case, the data of the fiber1 of both the n1 storage node and the n3 storage node need to be exchanged through the network and concentrated on the n2 storage node, and obviously, this approach will cause a great network data exchange overhead. Through the steps, the characteristics of entity-oriented query are considered, data are loaded from each storage node to the local memory as much as possible, and network data exchange of the data in the loading process is reduced.

Step 103: migrating the dimension data among the storage nodes according to the entity information in the entity cluster; the dimension data in this step is one item of entity information in the entity cluster.

There are three basic strategies for joining tables, including Broadcast Join (Broadcast Join), Hash Join (Hash Join), and Sort Merge Join (Sort Merge Join). The broadcast connection is suitable for the connection operation between a large table and a small table, and when the data amount of the data tables on both sides of the connection operation is different, Hash connection (Hash Join) or Sort Merge connection (Sort Merge Join) is generally adopted.

During join operations, data partitioning operations are required to reach join operators from upstream operators, such as scanning, filtering, and the like. Taking hash join as an example, generally speaking, if there is no distribution information of data in the memory, a simple hash algorithm is adopted to distribute the data to the downstream join operation, then the tuples of the two tables participating in the join are all subjected to hash operation, the tuples with the same hash value, that is, the records, reach the target node, and then are locally joined.

In our system, we know the data volume of the entity cluster on each node hard disk according to the meta information, and certainly include the data volume occupied by loading the data into the memory, so that the idea of avoiding unnecessary data exchange on the network can be achieved. Because the dimension tables, such as the user table and the log data table, are clustered according to the id of the entity and belong to different entity clusters, the connection operation can be locally performed without network data exchange. As shown in table three:

table three: each entity cluster contained in the memory data set of each storage node (user tables and log data tables are connected locally without data exchange between storage nodes via the network)

Based on the number of data blocks that have been previously loaded from the hard disk into the physical clusters of memory, we find that there are 12 data blocks on storage node n1, containing the information of fiber1 and fiber 2. There are 15 data blocks on storage node 2, containing the information of fiber3 and fiber 4. There are 13 data blocks on the storage node 3, containing the information of fiber5 and fiber 6. Thus, when data is allocated to the join operator, we keep fiber1 and fiber2 on node n1, fiber3 and fiber4 on node n1, and fiber5 and fiber6 on node n3, thus minimizing the amount of data that needs to be exchanged over the network. The process for a dimension table, user table, is similar when the table has been partitioned by entity id. If the user table does not adopt the entity cluster-based division like the log table in advance, the user table can be divided again according to the entity cluster-based division mechanism, and necessary data exchange is necessary. Because the data size of dimension tables is typically small, and the large amount of data is log data, our approach is equally effective.

Step 104: the query instruction is executed.

By the database query method, database optimization construction and optimization query based on the entity cluster are realized, and query performance is improved.

Example two

Referring to fig. 2, a schematic structural diagram of an apparatus according to an embodiment of the present invention includes: the system comprises a database construction module, a storage node selection module and an optimization query module.

And the database construction module is used for constructing an entity cluster according to the entity information in the data, establishing a mapping relation between the entity cluster and the storage node, and storing the data to the storage node according to the mapping relation. For example, the database construction module constructs an entity cluster according to one or more entity information in the data, establishes a mapping relationship between the entity cluster and the storage node, and stores the data to the storage node according to the mapping relationship. Data of one entity cluster can be stored in a plurality of storage nodes, or data of a plurality of entity clusters can be stored in one storage node.

And the storage node selection module is used for reading the data which belong to the same entity cluster but are positioned on different storage nodes into the memory of one storage node according to the query instruction. For example, data belonging to the same physical cluster but located on different storage nodes may be read into the memory of the storage node storing the most data of the physical cluster.

And the optimization query module is used for migrating the dimension data among the storage nodes according to the entity information in the entity cluster and executing a query instruction. The optimization query module can migrate one item of entity information in the entity cluster among the storage nodes according to the entity information in the entity cluster, and execute a query instruction.

The system described in this embodiment can bring the same technical effects as the method described in the first embodiment, and details are not described here.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A database query method, comprising the steps of:

constructing an entity cluster according to entity information in the data, establishing a mapping relation between the entity cluster and a storage node, and storing the data to the storage node according to the mapping relation; the mapping relation from the entity cluster to the storage nodes is adjusted once at intervals, so that the main copy of each entity cluster is mainly stored on one storage node during the interval, and is stored on the other storage node after the interval;

reading data which belong to the same entity cluster but are positioned on different storage nodes into a memory of one storage node according to the query instruction; reading data into memory includes: searching a data block list containing data relevant to query by using the meta information, dividing the data according to the relation data among the storage nodes, the data blocks and the entity clusters calculated from the meta information, and loading the data from each storage node to a local memory as much as possible;

migrating the dimension data among the storage nodes according to the entity information in the entity cluster; acquiring the data volume of the entity cluster on each node hard disk and the data volume occupied by loading the data into a memory according to the meta information, clustering the data according to the id of the entity according to the dimensional data, and performing connection operation on each entity cluster locally;

the query instruction is executed.

2. The method of claim 1, wherein:

the entity information includes: one or more entity information.

3. The method of claim 1, wherein:

4. The method of claim 1, wherein:

5. The method of claim 1, wherein:

the dimension data is one item of entity information in the entity cluster.

6. A database query device, comprising:

the database construction module is used for constructing an entity cluster according to entity information in the data, establishing a mapping relation between the entity cluster and the storage node, and storing the data to the storage node according to the mapping relation; the mapping relation from the entity cluster to the storage nodes is adjusted once at intervals, so that the main copy of each entity cluster is mainly stored on one storage node during the interval, and is stored on the other storage node after the interval;

the storage node selection module is used for reading data which belong to the same entity cluster but are positioned on different storage nodes into a memory of one storage node according to the query instruction; reading data into memory includes: searching a data block list containing data relevant to query by using the meta information, dividing the data according to the relation data among the storage nodes, the data blocks and the entity clusters calculated from the meta information, and loading the data from each storage node to a local memory as much as possible;

the optimization query module is used for migrating the dimension data among the storage nodes according to the entity information in the entity cluster, acquiring the data volume of the entity cluster on the hard disk of each node and the data volume occupied by loading the data into the memory according to the meta information, clustering the data according to the id of the entity according to the dimension data, and performing connection operation on each entity cluster locally; the query instruction is executed.

7. The apparatus of claim 6, wherein:

8. The apparatus of claim 6, wherein:

9. The apparatus of claim 6, wherein:

10. The apparatus of claim 6, wherein:

and the optimization query module is further used for migrating one item of entity information in the entity cluster among the storage nodes according to the entity information in the entity cluster and executing a query instruction.