CN114817311A

CN114817311A - Parallel computing method applied to GaussDB database storage process

Info

Publication number: CN114817311A
Application number: CN202210391361.8A
Authority: CN
Inventors: 邹昌根; 刘建; 高星; 龚丽丽
Original assignee: Shanghai Zhangshu Technology Co ltd
Current assignee: Shanghai Zhangshu Technology Co ltd
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2022-07-29
Anticipated expiration: 2042-04-14
Also published as: CN114817311B

Abstract

The invention discloses a parallel computing method applied to a GaussDB database storage process, which comprises the following steps: adding a processing code of the parallel characteristic in a GaussDB database engine, and compiling a GaussDB database kernel supporting the parallel characteristic; adding a control logic for the number of parallel paths in the storage process of the GaussDB database so that the storage process supports the characteristic of dynamically opening and closing the parallel; assigning a distribution field for the GaussDB database, and starting the parallel characteristic; starting the parallel characteristic, and appointing a distribution field for a GaussDB database which starts the parallel characteristic; by applying the method, the multi-core parallel capability of the bottom layer resource can be utilized to the maximum extent, and the performance reduction caused by data inclination is avoided.

Description

Parallel computing method applied to GaussDB database storage process

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a parallel computing method applied to a GaussDB database storage process.

Background

The mainstream computer systems in the market have the capability of implementing parallel computing by using Multi-CPU or CPU Multi-core technology, which is called symmetric Multi-Processing (SMP), and refers to a set of processors (Multi-CPU) collected on one computer, and each CPU shares a memory subsystem and a bus structure. It is a parallel technology which is widely applied compared with the asymmetric multiprocessing technology. GaussDB (for openGauss) is a commercialized database product that was marketed by Hua for companies, supports this important property, and opens it to the openGauss community.

The SMP property improves the performance through operator parallelism, and simultaneously occupies more system resources including CPU, memory, I/O and the like. In essence, the SMP is a method of exchanging resources for time, and can achieve a better performance improvement effect in a suitable scenario and under the condition of sufficient resources. Generally, the SMP characteristics are suitable for analyzing class query scenarios, which are characterized by long single query time and low service concurrency. By adopting the SMP parallel technology, the query time delay can be reduced, and the system throughput performance can be improved.

The Gaussian DB database is a commercial database product supporting SMP characteristics, but SMP parallel technology characteristics in the Gaussian DB database only support the parallel of a small number of operators such as Scan, Join, Agg, Stream and the like. It is not supported in the following scenarios: index scan, MergeJoin, cursor, store procedures and intra-function, sub-queries, global temporary table queries, materialized view updates, and the like. However, because a large amount of storage processes are used in the security core business system, a method which can support and reasonably apply the characteristics of the storage processes of the GaussDB database is urgently needed, and the application of the method can greatly improve the running performance of the storage processes.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

The invention aims to provide a parallel computing method applied to a GaussDB database storage process, which optimizes partial logic of a kernel of the GaussDB database, realizes the self-starting of SMP parallel characteristics in a session in the storage process, and can greatly improve the execution efficiency of batch processing of the GaussDB database.

The second purpose of the present invention is to provide a parallel computing system applied to the storage process of the GaussDB database, which is designed based on the above method, and both the system and the method provided by the first purpose of the present invention can be used in large-batch data loading, updating and complex query service scenarios.

In order to achieve the above purpose of the present invention, the following technical solutions are adopted:

the method comprises the following steps:

adding a processing component with parallel characteristics in a GaussDB database engine, and compiling a GaussDB database kernel supporting the parallel characteristics according to the processing component with the parallel characteristics;

adding a control logic for the number of parallel paths in the storage process of the GaussDB database so that the storage process supports the characteristic of dynamically opening and closing the parallel;

assigning a distribution field for the GaussDB database, and starting the parallel characteristic;

and calling the storage process with the opened parallel characteristic by using JDBC, and calculating in the GaussDB database kernel to obtain a calculation result set.

Preferably, the method further comprises the following steps after the parallel characteristic is turned on:

installing and deploying the compiled GaussDB database kernel;

importing a data model of a storage process into the GaussDB database kernel;

and importing service data into the kernel of the GaussDB database, and executing the storage process with the opened parallel characteristic according to the service data.

The database kernel needs to be installed and deployed in a computer and an operating system which support SMP multi-core parallel processing capability, and the GaussDB database kernel is installed and deployed along with compiled GaussDB binary software.

After the binary software of the GaussDB is set, starting the SMP (symmetric multi-processing) characteristic of the storage process, and importing a series of data models such as a service table, the storage process, a view and a function.

And after the data model is imported, business data to be processed is also imported, and the storage process with the parallel characteristic started is executed according to the business data to be processed and the business scheduling requirement.

Preferably, the method for performing calculation in the kernel of the GaussDB database comprises the following steps:

transmitting SQL sentences to be executed into the GaussDB database kernel, and analyzing the SQL sentences to obtain analysis results;

splitting the analysis result in a data slicing mode and respectively carrying out data calculation on the split result to obtain a plurality of calculation results;

and combining, counting and aggregating the calculation results according to the mode specified in the SQL statement, and sorting and screening the calculation results according to conditions to generate a calculation result set.

The tool for analyzing the SQL statement to be executed is a query optimizer, the query optimizer analyzes the SQL statement transmitted into the kernel, splits the part of logic of the data acquired in the SQL statement according to a data slicing method, and sends the split parts of the logic to different execution units for respective processing, and the execution units support multithreading parallel.

And the execution unit calculates the plurality of segments to generate calculation results, the calculation results are returned in real time for merging, aggregation and statistics until all the working units return data results, and finally all the calculation results are sorted and screened to generate a calculation result set.

Preferably, the number of concurrent paths is set before the SQL statement is analyzed, and the number of concurrent paths is restored to one path after the SQL statement is analyzed, so that excessive occupation of system resources is avoided.

Preferably, the data slicing method includes the following steps:

and vertically slicing the analysis result according to the designated distribution field to ensure that the calculation amount of the split results is the same, thereby improving the parallel calculation efficiency.

A second objective of the present invention is to disclose a parallel computing system applied to a GaussDB database storage process, comprising:

a compiling module: adding a processing component with parallel characteristics in a GaussDB database engine, and compiling a GaussDB database kernel supporting the parallel characteristics according to the processing component with the parallel characteristics;

and a modification module: adding a control logic for the number of parallel paths in the storage process of the GaussDB database so that the storage process supports the characteristic of dynamically opening or closing the parallel;

a starting module: assigning a distribution field for the GaussDB database, and starting the parallel characteristic;

an execution module: and calling the storage process with the opened parallel characteristic by using JDBC, and calculating in the GaussDB database kernel to obtain a calculation result set.

Also disclosed is a computer-readable storage medium having a computer program stored thereon which, when executed, performs the steps of the above-described method.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method when executing the program.

Compared with the prior art, the invention has the beneficial effects that:

(1) the method can be used for large-batch data loading, updating and complex query service scenes, and has high popularization value and strong practicability.

(2) The multi-core parallel capability of bottom layer resources can be utilized to the maximum extent, and performance reduction caused by data inclination is avoided.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart of a parallel computing method applied to a storage process of a GaussDB database according to the present embodiment;

FIG. 2 is a diagram of a parallel computing system applied to a GaussDB database storage process according to the present embodiment;

FIG. 3 is a schematic flow chart provided by this embodiment;

fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings and the detailed description, but those skilled in the art will understand that the following described embodiments are some, not all, of the embodiments of the present invention, and are only used for illustrating the present invention, and should not be construed as limiting the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to more clearly explain the technical solution of the present invention, the following description is made in terms of specific embodiments.

As shown in fig. 1 to 3, the present embodiment provides a parallel computing method applied to a GaussDB database storage process, and the method includes the following steps:

s101: adding a processing component with parallel characteristics in a GaussDB database engine, and compiling a GaussDB database kernel supporting the parallel characteristics according to the processing component with the parallel characteristics;

s102: adding control logic for the number of parallel paths in the storage process of the GaussDB database so that the storage process supports the characteristic of dynamically turning on and off the parallel;

s103: assigning a distribution field for the GaussDB database, and starting the parallel characteristic;

s104: and calling the storage process with the opened parallel characteristic by using JDBC, and calculating in the GaussDB database kernel to obtain a calculation result set.

It should be noted that before step S101, a store creation process needs to be performed in the GaussDB database, as shown in the "CREATE process prod _ XXX (out result int)", where the store creation process is "prod _ XXX".

In step S102, the added control logic SETs the number of parallel paths, if "SET query _ dot ═ 4" indicates that 4-way parallelism is started, and "SET query _ dot ═ 4" indicates that the number of parallel paths is 1, that is, indicates that default parallelism is restored.

installing and deploying the compiled GaussDB database kernel;

importing a data model of a storage process into the GaussDB database kernel;

Preferably, the method for performing calculation in the GaussDB database kernel includes the following steps:

In step S103, the distribution field is designated in a more reasonable range, in this embodiment, the distribution field is designated in an ID Hash manner, after the distribution field is designated, each compute node (Worker) automatically processes the fragment data belonging to the node according to the distribution field, and the compute node immediately processes other compute logics after the computation is completed; if a reasonable distribution field is not specified in step S103, the quantity of the fragmented data acquired by different compute nodes is unbalanced, which results in a large difference in execution time length and thus low parallel efficiency.

In the embodiment, vertical fragmentation is selected as the data fragmentation mode, so that the calculation amount of the split results is ensured to be the same, and further the parallel calculation efficiency is the highest.

Generally, when a multi-table query is performed, a column join method is often used, in this way, if the computation amount on the join column of the main table in the query is much larger than that on other columns, in a parallel state, a database kernel automatically performs hash redistribution on the table data, so that the computation amount of a certain parallel thread is much larger than that of other threads, and thus a long tail effect is generated, and the parallel effect is poor; in addition, when table aggregation is performed, if distribution of aggregation columns in data nodes is unbalanced, the calculation amount of a certain parallel thread is much larger than that of other threads in a parallel state, and a parallel effect is poor.

Therefore, the data fragmentation mode provided by this embodiment can achieve relatively even distribution, and each parallel thread can be divided into the same or approximately multiple data parallel computation amounts, so as to maximize the parallel performance. Meanwhile, through analysis, statistics and calculation of service data, reasonable join columns, aggregation columns and the like are used, and higher parallel return can be obtained.

Preferably, a concurrency number is SET before parsing the SQL statement, that is, the concurrency number is SET to 4 using a "SET query _ dop ═ 4" statement, and the concurrency number is restored to one path using a "SET query _ dop ═ 1" statement after parsing the SQL statement, so that excessive occupation of system resources due to a large number of parallel computations caused by concurrency setting is avoided.

In step S104, to call the storage process with the parallel feature turned on using JDBC, so-called JDBC, i.e. java database Connection, a Connection to the database is first created, which is implemented in a "Connection conn ═ dbu. Then, an interface for executing the stored procedure is created in a database connected with the interface through a "call technology cs ═ conn.preparalcall (" call prod _ XXX () ")" statement, wherein "cs" is the created interface, and "prod _ XXX ()" is the stored procedure; and starting to execute the interface after the interface of the storage process is docked, and outputting data transmitted by the interface, wherein the function of cs.

The embodiment further provides a parallel computing system applied to the storage process of the GaussDB database, and the system includes:

the compiling module 201: adding a processing component with parallel characteristics in a GaussDB database engine, and compiling a GaussDB database kernel supporting the parallel characteristics according to the processing component with the parallel characteristics;

the modification module 202: adding a control logic for the number of parallel paths in the storage process of the GaussDB database so that the storage process supports the characteristic of dynamically opening or closing the parallel;

the start-up module 203: assigning a distribution field for the GaussDB database, and starting the parallel characteristic;

the execution module 204: and calling the storage process with the opened parallel characteristic by using JDBC, and calculating in the GaussDB database kernel to obtain a calculation result set.

Fig. 4 is a schematic structural diagram of a computer device disclosed by the invention. Referring to fig. 4, the computer apparatus includes: an input device 63, an output device 64, a memory 62 and a processor 61; the memory 62 for storing one or more programs; when the one or more programs are executed by the one or more processors 61, the one or more processors 61 are caused to implement the parallel computing method provided in the above embodiment; wherein the input device 63, the output device 64, the memory 62 and the processor 61 may be connected by a bus or other means, as exemplified by the bus connection in fig. 4.

The memory 62 is a computer readable and writable storage medium, and can be used for storing a software program, a computer executable program, and program instructions corresponding to the method according to the embodiment of the present application; the memory 62 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like; further, the memory 62 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device; in some examples, the memory 62 may further include memory located remotely from the processor 61, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 63 is operable to receive input numeric or character information and to generate key signal inputs relating to user settings and function control of the apparatus; the output device 64 may include a display device such as a display screen.

The processor 61 executes various functional applications of the device and data processing by executing software programs, instructions, and modules stored in the memory 62.

The computer device provided above can be used to execute the parallel computing method provided in the above embodiments, and has corresponding functions and advantages.

Embodiments of the present application also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are used to perform the parallel computing method provided in the above embodiments, the storage medium being any of various types of memory devices or storage devices, the storage medium including: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc.; the storage medium may also include other types of memory or combinations thereof; a storage medium includes two or more storage media that may reside in different locations, such as in different computer systems connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A parallel computing method applied to a GaussDB database storage process is characterized by comprising the following steps:

2. The method of claim 1, further comprising, after turning on the parallel feature, the steps of:

installing and deploying the compiled GaussDB database kernel;

importing a data model of a storage process into the GaussDB database kernel;

3. The method of claim 1, wherein the method of performing computations in the GaussDB database kernel comprises the steps of:

4. The method according to claim 3, wherein a concurrent path number is set before the SQL statement is parsed, and the concurrent path number is restored to one path after the SQL statement is parsed, thereby avoiding excessive occupation of system resources.

5. The method of claim 3, wherein the data slicing method comprises the steps of:

6. System based on the method according to any of the preceding claims 1-5, characterized in that it comprises:

7. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed, carries out the steps of the method of any one of claims 1 to 5.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-5 are implemented when the program is executed by the processor.