CN111813761A

CN111813761A - Database management method and device and computer storage medium

Info

Publication number: CN111813761A
Application number: CN202010584234.0A
Authority: CN
Inventors: 黄乐; 朱林浩; 何林强
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2020-10-23
Anticipated expiration: 2040-06-23
Also published as: CN111813761B

Abstract

The application discloses a database management method, a device and a computer storage medium, wherein the database management method comprises the following steps: the database management method comprises the following steps: acquiring query request information and generating an execution plan according to the query request information; distributing the execution plan to a plurality of computing nodes so that the computing nodes execute the query request according to the execution plan and record the number of scanned tuples; acquiring query request results and tuple quantity of a plurality of computing nodes; comparing the tuple quantity of the plurality of computing nodes to obtain a first computing node corresponding to the maximum tuple quantity and a second computing node corresponding to the minimum tuple quantity; and when the difference between the maximum tuple quantity and the minimum tuple quantity reaches a preset condition, transferring the tuple data of the first computing node to the second computing node. By the method, the data skew problem can be found dynamically, data migration is carried out, and optimal data distribution is achieved.

Description

Database management method and device and computer storage medium

Technical Field

The present application relates to the field of database management technologies, and in particular, to a database management method, apparatus, and computer storage medium.

Background

At present, with the increasing of a large amount of business data and various data accumulated by a social network, how to efficiently store the data will have a great influence on rapidly inquiring the records meeting the conditions in mass data storage.

In a conventional distributed database, common data distribution modes are hash distribution (hash distribution), range distribution and random distribution. In case of hot spot data, both the hash distribution and the range distribution are used, which leads to a data skew problem. The effects of data skew include: some calculation nodes have larger data volume, and some calculation nodes have smaller data volume, so that the query time is longer; because the query speed on some nodes is high, the resources are released, and the part of the resources are not utilized.

Disclosure of Invention

The application provides a database management method, a database management device and a computer storage medium, which aim to solve the problem that data inclination is easy to occur in the prior art.

In order to solve the technical problem, the application adopts a technical scheme that: there is provided a database management method, including:

acquiring query request information and generating an execution plan according to the query request information;

distributing the execution plan to a plurality of computing nodes so that the computing nodes execute query requests according to the execution plan and record the number of scanned tuples;

obtaining the query request results and the tuple quantity of the plurality of computing nodes;

comparing the tuple quantity of the plurality of computing nodes to obtain a first computing node corresponding to the maximum tuple quantity and a second computing node corresponding to the minimum tuple quantity;

and when the difference between the maximum tuple quantity and the minimum tuple quantity reaches a preset condition, transferring the tuple data of the first computing node to the second computing node.

Wherein the step of obtaining the query request results and the tuple number of the plurality of computing nodes comprises:

acquiring each tuple record in the plurality of computing nodes and the corresponding record quantity;

sequencing the tuple records from large to small according to the record number of each computing node;

and recording the tuple quantity of the plurality of computing nodes, the previous M tuple records of each computing node and the corresponding record quantity into a tuple statistical table.

When the difference between the maximum tuple quantity and the minimum tuple quantity reaches a preset condition, the step of migrating the tuple data of the first computing node to the second computing node comprises the following steps:

migrating the tuple data of the first compute node to the second compute node when the number of tuples of the first compute node is at least twice the number of tuples of the second compute node.

Wherein the step of migrating the tuple data of the first computing node to the second computing node comprises:

and migrating a first tuple record of the first computing node to the second computing node, wherein the first tuple record is the tuple record with the largest number of records in the first computing node.

Wherein after the step of migrating the first tuple record of the first computing node to the second computing node, the database management method further comprises:

storing data migration information in a migration information table, wherein the data migration information includes the first tuple record and a second compute node.

Wherein, the step of executing the query request by the plurality of computing nodes according to the execution plan respectively comprises:

the coordination node searches whether a relevant record exists in the migration information table or not according to the execution plan;

and if so, directly acquiring the tuple data of the corresponding position according to the migration information table.

Wherein, the step of distributing the execution plan to a plurality of computing nodes to make the plurality of computing nodes execute the query request according to the execution plan and record the number of scanned tuples comprises:

determining whether the execution plan includes a sequential table scan operator;

if yes, the plurality of computing nodes execute the query request according to the execution plan, record the number of scanned tuples and return the query request result and the number of tuples;

if not, the plurality of computing nodes respectively execute the query request according to the execution plan and return the query request result.

In order to solve the above technical problem, another technical solution adopted by the present application is: providing another database management method, wherein the database management method is applied to a database management system, and the database management system comprises a coordination node and a plurality of computing nodes; the database management method comprises the following steps:

the coordination node acquires query request information of a client and generates an execution plan according to the query request information;

the co-regulation point distributes the execution plan to the number of compute nodes;

the plurality of computing nodes execute the query request according to the execution plan, record the number of scanned tuples and return the query request result and the tuple number to the coordination node;

the coordination node acquires the query request result and the tuple quantity, compares the tuple quantity of the plurality of computing nodes, and acquires a first computing node corresponding to the maximum tuple quantity and a second computing node corresponding to the minimum tuple quantity;

when the difference between the maximum tuple quantity and the minimum tuple quantity reaches a preset condition, the coordinating and adjusting point migrates the tuple data of the first computing node to the second computing node.

In order to solve the above technical problem, another technical solution adopted by the present application is: providing a database management apparatus comprising a processor and a memory; the memory has stored therein a computer program for execution by the processor to implement the steps of the database management method as described above.

In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a computer storage medium having a computer program stored thereon, the computer program when executed implementing the steps of the database management method described above.

Different from the prior art, the beneficial effects of this application lie in: the coordination node acquires the query request information and generates an execution plan according to the query request information; distributing the execution plan to a plurality of computing nodes so that the computing nodes execute the query request according to the execution plan and record the number of scanned tuples; acquiring query request results and tuple quantity of a plurality of computing nodes; comparing the tuple quantity of the plurality of computing nodes to obtain a first computing node corresponding to the maximum tuple quantity and a second computing node corresponding to the minimum tuple quantity; and when the difference between the maximum tuple quantity and the minimum tuple quantity reaches a preset condition, transferring the tuple data of the first computing node to the second computing node. By the method, the data skew problem can be found dynamically, data migration is carried out, and optimal data distribution is achieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram of a first embodiment of a database management method provided by the present application;

FIG. 2 is a schematic structural diagram of an embodiment of a database system provided herein;

FIG. 3 is a schematic flow chart diagram of a second embodiment of a database management method provided by the present application;

FIG. 4 is a schematic flow chart diagram of a third embodiment of a database management method provided by the present application;

FIG. 5 is a schematic flow chart diagram of a fourth embodiment of a database management method provided by the present application;

FIG. 6 is a schematic diagram illustrating an embodiment of a database management apparatus;

FIG. 7 is a schematic structural diagram of an embodiment of a computer storage medium provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the problem that data inclination is easy to occur in the prior art, the application provides a database management method. Referring to fig. 1 and fig. 2 in detail, fig. 1 is a schematic flowchart of a first embodiment of a database management method provided by the present application, and fig. 2 is a schematic structural diagram of an embodiment of a database management method provided by the present application.

The database management method of the present application is applied to the database system of fig. 2, wherein the database system 100 includes one coordinating node 11 and several computing nodes 12. The coordination node 11 is responsible for distributing data and tasks to the computing nodes 12, summarizing computing results of the computing nodes 12, and finally returning the results to the user client. The compute nodes 12 are responsible for data storage and actually performing the computational tasks. The database of the present application specifically refers to a database for managing Massive Parallel Processing (MPP).

The primary goal of data distribution design in an MPP database is the uniform distribution of data among the various nodes of the system. In particular, multiple processors are coordinated to process programs in parallel, where each processor has independent operating system and memory resources. This system may be referred to as "shared-nothing" in which the tables of the database are partitioned into segments and distributed among different processing nodes, with no data sharing occurring among the processing nodes. The data is partitioned among the processing nodes such that each processing node has a subset of rows from the tables of the database. Each processing node processes only rows on its own disk.

As shown in fig. 1, the database management method of the present embodiment specifically includes the following steps:

s101: and acquiring query request information, and generating an execution plan according to the query request information.

When a user client inputs a query statement, the coordination node acquires query request information about the query statement and generates an execution plan according to the query request information. The execution plan includes tasks and data distributed to the various compute nodes.

S102: distributing the execution plan to a plurality of computing nodes, so that the plurality of computing nodes execute the query request according to the execution plan respectively, and recording the number of the scanned tuples.

And the coordination node distributes the execution plan to a plurality of computing nodes according to the division condition of the execution plan. And each computing node executes the query request according to the execution task acquired by the computing node and records the number of the scanned tuples.

Specifically, each time the plan is executed, the compute node needs to return the result of executing the query request to the coordinating node for summarization. In addition, each compute node needs to determine whether a sequence table scan (tablescan) operator is included in the execution plan. If the sequence table scanning operator exists, the calculation node needs to further record the number of the scanned tuples and returns the tuple number and the query request result to the coordination node for summarizing.

S103: and acquiring the query request results and the tuple quantity of the plurality of computing nodes.

The coordination node acquires the tuple quantity of a plurality of calculation nodes and uniformly records the tuple quantity into the tuple statistical table. The tuple statistical table is mainly used for counting tuple records and tuple quantity of each computing node, and is used for comparing the tuple quantity and distributing migration data.

S104: and comparing the tuple quantity of the plurality of computing nodes to obtain a first computing node corresponding to the maximum tuple quantity and a second computing node corresponding to the minimum tuple quantity.

The coordination nodes compare the tuple quantity of the plurality of computing nodes, and extract a first computing node corresponding to the maximum tuple quantity and a second computing node corresponding to the minimum tuple quantity. The coordination node can judge whether the data tilt problem occurs in the database by comparing the difference between the first computing node and the second computing node.

It should be noted that the first computing node and the second computing node in this embodiment do not refer to a certain computing node specifically, and specifically refer to a comparison result, such as a maximum value and a minimum value, generated when the tuple number of the computing node is counted each time. Due to the fact that the storage positions of the tuples and the storage data are dynamically changed, whether the data tilt problem occurs or not can be dynamically judged in real time in this mode.

S105: and when the difference between the maximum tuple quantity and the minimum tuple quantity reaches a preset condition, transferring the tuple data of the first computing node to the second computing node.

When the tuple quantity difference of the first computing node and the second computing node is compared by the coordination node and reaches a preset condition, the coordination node migrates partial tuple data of the first computing node to the second computing node.

Specifically, the preset condition may be: the difference between the tuple quantity of the first computing node and the tuple quantity of the second computing node is greater than a preset fixed value; or the number of tuples of the first computing node is N times the number of tuples of the second computing node, wherein N > 1.

When the first computing node and the second computing node meet the preset condition, the situation that the data is inclined is shown, and the coordination node needs to execute data migration to solve the situation of the data inclination.

Further, after the coordinating node completes the data migration task once, the steps 102 to 105 may be executed again, and the tuple data of the plurality of computing nodes is re-compared. And if the difference between the computing node with the maximum tuple quantity and the computing node with the minimum tuple quantity still meets the preset condition, executing data migration again until the problem of data inclination is completely solved.

In this embodiment, the coordinating node obtains query request information and generates an execution plan according to the query request information; distributing the execution plan to a plurality of computing nodes so that the computing nodes execute the query request according to the execution plan and record the number of scanned tuples; acquiring query request results and tuple quantity of a plurality of computing nodes; comparing the tuple quantity of the plurality of computing nodes to obtain a first computing node corresponding to the maximum tuple quantity and a second computing node corresponding to the minimum tuple quantity; and when the difference between the maximum tuple quantity and the minimum tuple quantity reaches a preset condition, transferring the tuple data of the first computing node to the second computing node. By the method, the data skew problem can be found dynamically, data migration is carried out, and optimal data distribution is achieved.

On the basis of step 103 of the above database management method, the present application also proposes another specific database management method. Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of a database management method according to the present application.

As shown in fig. 3, the database management method of the present embodiment specifically includes the following steps:

s201: and acquiring each tuple record in a plurality of computing nodes and the corresponding record quantity.

The coordination node acquires each tuple record in the plurality of calculation nodes and the corresponding record quantity, and forms a tuple statistical table based on the record quantity. The tuple statistics table is specifically as follows:

segment-1	segment-2	segment-n
			100000	200000	300000

at this time, if n is 1.5, the current maximum tuple number is 3 times of the minimum tuple number, and a data skew condition occurs in the tuple statistical table according to the preset condition judgment of the above embodiment.

S202: the number of records for each compute node is ordered the tuple records from big to small.

S203: and recording the tuple quantity of a plurality of computing nodes, the previous M tuple records of each computing node and the corresponding record quantity into a tuple statistical table.

The coordinating node can further obtain the tuple record of each computing node, and count the tuple quantity of each tuple record.

In order to reduce the calculation overhead, the coordinating node records the tuple records with the tuple quantity arranged in the first M bits in each calculation node into the tuple statistical table. At this time, the tuple statistics table is specifically as follows:

	segment-1	segment-2	segment-n
				total amount of	100000	200000	300000
1	Zhe A1111:50000	Zhejiang A2222:60000	Zhejiang A3333:80000
				2	Zhejiang A4444:40000	Zhejiang A5555:50000	Zhejiang A6666:70000
M	Zhe A7777:30000	Zhejiang A8888:40000	Zhejiang A9999:60000

For example, record "Zhe A1111: 50000" in the tuple statistics table indicates that there are 50000 records for tuple "Zhe A1111".

S204: and comparing the tuple quantity of the plurality of computing nodes to obtain a first computing node corresponding to the maximum tuple quantity and a second computing node corresponding to the minimum tuple quantity.

S205: and when the number of the tuples of the first computing node is at least twice of the number of the tuples of the second computing node, migrating the first tuple record of the first computing node to the second computing node, wherein the first tuple record is the tuple record with the largest record number in the first computing node.

The data migration method mainly carries out data migration through a greedy algorithm.

Specifically, the coordinating node first obtains a first computing node with the largest tuple quantity and a second computing node with the smallest tuple quantity, and then judges whether the difference between the first computing node and the second computing node meets a preset condition.

In this embodiment, when the number of tuples of the first computing node is at least twice the number of tuples of the second computing node, it is stated that the difference between the first computing node and the second computing node satisfies the preset condition. At this time, the coordinating node migrates the tuple record with the largest number in the first computing node, i.e., the first tuple record, to the second computing node.

Further, the coordinating node continuously judges whether the updated tuple statistical table has the problem of data inclination. If not, indicating that the data migration is completed, and entering step 206; and if so, continuing to circularly perform data inclination judgment and data migration until the condition of completing the data migration is met.

For example, the coordinating node migrates the tuple record "Zhe A3333" with the largest number of segment-n in the tuple statistics table to segment-1. In the updated tuple statistical table, the total amount of segment-n is 300000-. At this time, segment-n with the largest tuple number and segment-1 with the smallest tuple number are obtained by calculation, and n is 220000/180000 ═ 1.2<2, so that the condition of completing data migration is met, data migration does not need to be continued, and the process proceeds to step 206.

S206: and storing data migration information in a migration information table, wherein the data migration information comprises the first tuple record and the second computing node.

The coordinating node may further store the data migration information in step 205 in a migration information table, where the migration information table describes a change condition of the tuple statistics table, and the data migration condition in the above example is stored in the migration information table:

data of	Position of
		Zhejiang A3333	segment-1

Specifically, when the data in the tuple statistics table changes again, such as updating of data: if the data is subsequently migrated to segment-2, the corresponding location information needs to be modified to segment-2; such as deletion of data: if the data is migrated back to the original location segment-n, the record needs to be deleted.

On the basis of step 102 of the above database management method, the present application also proposes another specific database management method. Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a database management method according to a third embodiment of the present application.

As shown in fig. 4, the database management method of the present embodiment specifically includes the following steps:

s301: and the coordination node searches whether a related record exists in the migration information table according to the execution plan.

Before distributing the execution plan to each computing node, the coordinating node may first search whether a relevant record exists in the migration information table according to the execution plan. If yes, go to step 302; if not, go to step 303.

S302: and directly acquiring the tuple data of the corresponding position according to the migration information table.

The coordination node directly acquires data from the position recorded in the migration information table.

S303: and searching the position of the tuple data according to the original data distribution strategy.

And the computing node searches the position of the tuple data according to the execution plan and the original data distribution strategy. The data distribution policy may be: hash distribution, random (e.g., round robin) distribution, range distribution, list distribution, or the like.

On the basis of the above embodiments, the present application also provides another specific database management method. Referring to fig. 5, fig. 5 is a schematic flowchart illustrating a database management method according to a fourth embodiment of the present application.

As shown in fig. 5, the database management method of the present embodiment specifically includes the following steps:

s401: and the coordination node acquires the query request information of the client and generates an execution plan according to the query request information.

S402: the co-regulation point distributes the execution plan to several compute nodes.

S403: and the plurality of computing nodes execute the query request according to the execution plan, record the number of the scanned tuples and return the query request result and the tuple number to the coordination node.

S404: and the coordination node acquires the query request result and the tuple quantity, compares the tuple quantity of the plurality of computing nodes, and acquires a first computing node corresponding to the maximum tuple quantity and a second computing node corresponding to the minimum tuple quantity.

S405: and when the difference between the maximum value of the tuple quantity and the minimum value of the tuple quantity reaches a preset condition, the coordination node migrates the tuple data of the first computing node to the second computing node.

In order to implement the database management method of the foregoing embodiment, the present application further provides a database management apparatus, and please refer to fig. 6 specifically, where fig. 6 is a schematic structural diagram of an embodiment of the database management apparatus provided in the present application.

As shown in fig. 6, the database management apparatus 600 of the present embodiment includes a processor 61, a memory 62, an input-output device 63, and a bus 64.

The processor 61, the memory 62, and the input/output device 63 are respectively connected to the bus 64, the memory 62 stores a computer program, and the processor 61 is configured to execute the computer program to implement the database management method according to the above embodiment.

It should be noted that the database management apparatuses 600 according to the first to third embodiments of the database management method may be a server that is mounted and implements a function of a coordinating node, and the database management apparatus 600 according to the fourth embodiment of the database management method may be a server cluster or a distributed server that is mounted and implements the database system of fig. 2.

In the present embodiment, the processor 61 may also be referred to as a CPU (Central Processing Unit). The processor 61 may be an integrated circuit chip having signal processing capabilities. The processor 61 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The processor 61 may also be a GPU (Graphics Processing Unit), which is also called a display core, a visual processor, and a display chip, and is a microprocessor specially used for image operation on a personal computer, a workstation, a game machine, and some mobile devices (such as a tablet computer, a smart phone, etc.). The GPU is used for converting and driving display information required by a computer system, providing a line scanning signal for a display and controlling the display of the display correctly, is an important element for connecting the display and a personal computer mainboard, and is also one of important devices for man-machine conversation. The display card is an important component in the computer host, takes charge of outputting display graphics, and is very important for people engaged in professional graphic design. A general purpose processor may be a microprocessor or the processor 61 may be any conventional processor or the like.

The present application also provides a computer storage medium, as shown in fig. 7, the computer storage medium 700 is used for storing a computer program 71, and the computer program 71, when executed by a processor, is used for implementing the method as described in the database management method embodiment of the present application.

The method involved in the embodiments of the database management method of the present application, when implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a device, such as a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A database management method, wherein the database is a distributed database, the database management method comprising:

2. The database management method according to claim 1,

the step of obtaining the query request results and the tuple number of the plurality of computing nodes comprises:

3. The database management method according to claim 2,

4. The database management method according to claim 2,

the step of migrating the tuple data of the first computing node to the second computing node comprises:

5. The database management method according to claim 4,

after the step of migrating the first tuple record of the first computing node to the second computing node, the database management method further comprises:

6. The database management method according to claim 5,

the step of executing the query request by the plurality of computing nodes according to the execution plan respectively comprises the following steps:

searching whether a related record exists in the migration information table according to the execution plan;

7. The database management method according to claim 1,

the step of distributing the execution plan to a plurality of computing nodes so that the plurality of computing nodes execute the query request according to the execution plan and record the number of scanned tuples includes:

8. A database management method is applied to a database management system, wherein the database management system comprises a coordination node and a plurality of computing nodes; the database management method comprises the following steps:

9. A database management apparatus, characterized in that the database management apparatus comprises a processor and a memory; the memory has stored therein a computer program for execution by the processor to perform the steps of the database management method according to any of claims 1 to 8.

10. A computer storage medium storing a computer program which, when executed, performs the steps of a database management method according to any one of claims 1 to 8.