CN118035270A

CN118035270A - Data query method, device, software program, equipment and storage medium

Info

Publication number: CN118035270A
Application number: CN202211389714.7A
Authority: CN
Inventors: 叶强盛; 蒋杰; 刘煜宏; 陈鹏; 程广旭; 杜佶峻; 宾莉金; 张韶全; 薛文伟; 唐文慧; 陈奕安; 黄俊奕; 韩正汀; 王刚; 陈九天; 王子贤; 刘攀; 邹若晨; 尚晓慧; 龙跃
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2024-05-14

Abstract

The invention provides a data query method, a data query device, a software program, electronic equipment and a storage medium, wherein the method comprises the following steps: utilizing the metadata of the first materialized view to rewrite the original structured query language instruction to obtain a target structured query language instruction; when the physical data of the first materialized view is of a partition table type, predicate compensation is carried out on a second materialized view matched with the target structured query language instruction, and a third materialized view is obtained; and carrying out data query in the third materialized view to obtain a query result of the original structured query language instruction, thereby ensuring the performance of a calculation engine and the performance of the materialized view and obtaining an accurate data query result by carrying out predicate compensation on the materialized view.

Description

Data query method, device, software program, equipment and storage medium

Technical Field

The present invention relates to data query technology, and in particular, to a data query method, apparatus, system, software program, electronic device, and storage medium.

Background

In the field of database technology, materialized views are very widely used. It is understood that a materialized view is a special physical table, and that a materialized view itself stores data as opposed to a normal view.

According to the use requirement of a user, data query is required based on materialized views, but the related technology can only support full materialized view update, so that the performance of a data processing calculation engine is reduced, if materialized view update is forcedly designated, the performance of the materialized views is reduced, and meanwhile, the accuracy of a data query result is influenced.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a data query method, apparatus, software program, electronic device, and storage medium, which can update a materialized view by performing predicate compensation on the materialized view, ensure the performance of a computing engine, and also ensure the performance of the materialized view, so as to obtain an accurate data query result.

The technical scheme of the embodiment of the invention is realized as follows:

The embodiment of the invention provides a data query method, which comprises the following steps:

An original structured query language instruction is received,

Acquiring a first materialized view matched with the original structured query language instruction;

Utilizing the metadata of the first materialized view to rewrite the original structured query language instruction to obtain a target structured query language instruction;

When the physical data of the first materialized view is of a partition table type, predicate compensation is carried out on a second materialized view matched with the target structured query language instruction, and a third materialized view is obtained;

and carrying out data query in the third materialized view to obtain a query result of the original structured query language instruction.

The embodiment of the invention also provides a data query device, which comprises:

the information transmission device is used for receiving the original structured query language instruction;

the information processing device is used for acquiring a first materialized view matched with the original structured query language instruction;

the information processing device is used for utilizing the metadata of the first materialized view to rewrite the original structured query language instruction to obtain a target structured query language instruction;

The information processing device is used for performing predicate compensation on the second materialized view matched with the target structured query language instruction when the physical data of the first materialized view is of a partition table type, so as to obtain a third materialized view;

The information processing device is used for carrying out data query in the third materialized view to obtain a query result of the original structured query language instruction.

In the above scheme, the information processing device is configured to parse a creating grammar of the materialized view through a grammar parser to obtain an increment supported by the materialized view and a logic plan grammar tree, where the creating grammar of the materialized view supports designating a materialized view table as a partition table;

The information processing device is used for analyzing the logic plan grammar tree to obtain the boundary of the increment of the materialized view;

the information processing device is used for analyzing the logic plan grammar tree to obtain a table building element of a return result;

The information processing device is used for creating a physical table by utilizing the table creating element, executing the creating grammar of the materialized view according to the boundary of the materialized view increment and obtaining the first materialized view.

In the above scheme, the information processing apparatus is configured to determine a partition field and a predicate type of physical data of the first materialized view;

The information processing device is used for compensating the range of the first boundary predicate of the partition field by utilizing the predicate type to obtain the range of the second boundary predicate;

and the information processing device is used for adjusting the second materialized view according to the range of the second boundary predicate to obtain the third materialized view.

In the above scheme, the information processing device is configured to obtain a modification instruction of the structured query language;

The information processing device is used for analyzing the modification instruction of the structured query language to obtain an increment boundary of the second chemical view;

The information processing device is used for replacing the increment boundary of the first materialized view through the increment boundary of the second materialized view to obtain the second materialized view.

In the above scheme, the information processing device is configured to receive job data to be processed, and submit the job data to be processed to a cluster resource manager, where the job data to be processed includes at least two original structured query language instructions;

the information processing device is used for triggering corresponding components according to the job data to be processed through the cluster resource manager and converting an original structured query language instruction in the job data to be processed into a task matched with a target calculation engine;

The information processing device is configured to perform data query by using a third materialized view set corresponding to the job data to be processed through the target computing engine, so as to obtain a query result of the job data to be processed, where the third materialized view set includes at least two third materialized views.

In the above scheme, the information processing device is configured to trigger, by using the cluster resource manager, one node manager in different service clusters according to the job data to be processed;

The information processing device is used for starting a job manager of the resource scheduling system through the triggered node manager and converting an object-oriented query language instruction in the job data to be processed into a task matched with a corresponding computing engine through the job manager of the resource scheduling system;

the information processing device is used for triggering the job manager of the computing engine through the job manager of the resource scheduling system.

In the above scheme, the information processing device is configured to detect physical data of the first materialized view;

The information processing device is used for triggering a predicate compensation process when the physical data of the first materialized view is of a partition table type, wherein the predicate compensation process is used for performing predicate compensation on a second materialized view;

The information processing device is used for sending prompt information when the physical data of the first materialized view is of a non-partition table type, wherein the prompt information is used for adjusting the creating grammar of the materialized view.

The embodiment of the invention also provides electronic equipment, which comprises:

a memory for storing executable instructions;

And the processor is used for realizing the data query method of the preamble when the executable instructions stored in the memory are operated.

The embodiment of the invention also provides a computer readable storage medium which stores executable instructions, and is characterized in that the executable instructions realize the preamble data query method when being executed by a processor.

The embodiment of the invention has the following beneficial effects:

The embodiment of the invention obtains a first materialized view matched with the original structured query language instruction; utilizing the metadata of the first materialized view to rewrite the original structured query language instruction to obtain a target structured query language instruction; when the physical data of the first materialized view is of a partition table type, predicate compensation is carried out on a second materialized view matched with the target structured query language instruction, and a third materialized view is obtained; and carrying out data query in the third materialized view to obtain a query result of the original structured query language instruction, thereby ensuring the performance of a calculation engine and the performance of the materialized view and obtaining an accurate data query result by carrying out predicate compensation on the materialized view.

Drawings

FIG. 1 is a schematic view of a usage environment of a data query method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a composition structure of a data query device according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of an alternative method for querying data according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of syntax parsing for creating a materialized view in an embodiment of the invention;

FIG. 5 is a diagram of a table-building sentence according to an embodiment of the present invention;

FIG. 6 is a diagram of a table-building sentence according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a chart of a spark component according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating metadata information for a first materialized view according to an embodiment of the invention;

FIG. 9 is a diagram illustrating a modification instruction according to an embodiment of the present invention;

FIG. 10 is a diagram of an instruction for a materialized view increment processing statement in an embodiment of the invention;

FIG. 11 is a diagram illustrating metadata information of a second chemical view according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of SQL statements corresponding to a third materialized view according to an embodiment of the invention;

FIG. 13 is a diagram of a Spark in a trunked mode of operation in accordance with the present invention;

FIG. 14 is a schematic diagram of a Hive operating architecture according to an embodiment of the present invention;

Fig. 15 is a schematic process diagram of a data polling method according to an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.

1) In response to a condition or state that is used to represent the condition or state upon which the performed operation depends, the performed operation or operations may be in real-time or with a set delay when the condition or state upon which it depends is satisfied; without being specifically described, there is no limitation in the execution sequence of the plurality of operations performed.

2) Terminals, including but not limited to: the system comprises a common terminal and a special terminal, wherein the common terminal is in long connection and/or short connection with a sending channel, and the special terminal is in long connection with the sending channel.

3) And (3) environment management: the environment management serves users of the enterprise developer, maintains the development environment from the view point of the developer, provides container environment management and use, simultaneously connects kubernetes clusters, provides online and tool end use capability, provides management for the developer users, and connects the container clusters and kubeconfig through the environment pool.

4) Application management: and the enterprise developer user packs the developed micro-service through the efficiency platform, manages the package through helm, and the efficiency platform manages, deploys and accesses the service through the service management package to provide basic capability.

5) Predicates are one of the functions introduced in the functions commonly used in SQL, and are functions which need to meet a specific condition, and the condition is that the return value is a true value. For a typical function, the return values may be numbers, strings, dates, etc., but the return values of predicates are all TRUE values (TRUE/FALSE/UNKNOWN). This is also the largest difference between predicates and functions.

6) Materialized views, which are database objects that include one query result, are local copies of remote data, or are used to generate summary tables based on a summation of data tables. Materialized views store remote table-based data, which may also be referred to as snapshots (similar to the snapshot in MSSQL SERVER, static snapshots). For replication, materialized views allow you to maintain copies of remote data locally, which are read-only. For data repositories, the materialized views created are typically aggregate views, single table aggregate views and join views.

Before introducing the data query method provided by the application, firstly, the adjustment defect of materialized view in the related technology is described, referring to table 1, database products in the related technology generally have materialized view functions, but generally only support full-scale update, and the mode of supporting incremental update is mainly based on Segment increment, namely, a new data block generates metadata.

TABLE 1

As shown in Table 1, many current products do not support incremental updates to materialized views because supporting incremental updates requires consideration of the correctness of query results before and after the update; clickhouse supports increment and real-time update, but does not support transparent acceleration, because the materialized view may only have a part of the original data, and the upper and lower bounds of the data cannot be known, which can lead to inconsistent query results; similar to Doris and ClickHouse, incremental logic is also an Insert of a piece of data into the original table, and the data is updated to the materialized view at the same time, and transparent acceleration is supported; maxCompute and support only fully materialized views; kylin is based on Cube theory, relies on dividing data into segments according to a certain column to realize increment, is a very good mode, can ensure consistency before and after inquiry, and can also support Segment data combination; druid materialized view logic is a way to wrap up data, also individual segments, supporting deltas, similar to the implementation of kylin.

In the related art shown in table 1, 1) products ClickHouse, doris supporting incremental update are all Insert one data to the original table and Insert one data to the physical-chemical view chart, at this time, the data processing efficiency of the database is reduced, and if the Insert data is wrong, deletion cannot be performed and rollback cannot be performed, so that a full-volume update is required.

2) Products such as Kylin, guide and the like are rolled up according to certain dimensions, a user can select a partition column to divide segments, one Segment corresponds to one metadata, and after the materialized view data of one Segment is generated, new Segment metadata is submitted; thereby ensuring the consistency of data before and after inquiry, but forcedly designating updating, the performance of materialized view is reduced.

In order to overcome the defects, the application provides a data query method, which can ensure the consistency of data query of materialized view data before and after incremental updating by performing predicate compensation on materialized views, and can ensure the performance of a computing engine and the performance of materialized views and obtain accurate data query results by performing predicate compensation on materialized views.

Fig. 1 is a schematic view of a usage scenario of a data query method provided in an embodiment of the present invention, referring to fig. 1, a terminal (including a terminal 10-1 and a terminal 10-2) is provided with corresponding clients capable of executing different functions, where the corresponding clients are data query results of a database obtained by the terminal (including the terminal 10-1 and the terminal 10-2) from a corresponding server 200 through a network 300, the terminal is connected to the server 200 through the network 300, the network 300 may be a wide area network or a local area network, or a combination of the two, and data transmission is implemented by using a wireless link, where in a process of information interaction between the terminal and the network, the server 200 may perform data query according to different data query instructions of different terminals by using different materialized views, so as to obtain corresponding data query results. Meanwhile, the user can edit the data stored in the database, so that the addition, deletion and adjustment of the data are realized.

In some embodiments of the present invention, the databases used in server 200 may be written in software code environments in different programming languages, and the code objects may be different types of code entities. For example, in software code in the C language, a code object may be a function. In software code in JAVA language, a code object may be a class, and in IOS side OC language may be a piece of object code. In the software code in the c++ language, a code object may be a class or a function to execute the request instructions from different terminals.

The following describes the structure of the data query device according to the embodiment of the present invention in detail, and the data query device may be implemented in various forms, such as a dedicated terminal with a processing function of the data query device, or may be a server provided with a processing function of the data query device, for example, the server 200 in fig. 1. Fig. 2 is a schematic diagram of a composition structure of a data query device according to an embodiment of the present invention, and it can be understood that fig. 2 only shows an exemplary structure of the data query device, but not all the structures, and some or all of the structures shown in fig. 2 may be implemented as required.

The data query device provided by the embodiment of the invention comprises: at least one processor 201, a memory 202, a user interface 203, and at least one network interface 204. The various components in the data querying device are coupled together by a bus system 205. It is understood that the bus system 205 is used to enable connected communications between these components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 205 in fig. 2.

The user interface 203 may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc.

It will be appreciated that the memory 202 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The memory 202 in embodiments of the present invention is capable of storing data to support operation of the terminal (e.g., 10-1). Examples of such data include: any computer program, such as an operating system and application programs, for operation on the terminal (e.g., 10-1). The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application may comprise various applications.

In some embodiments, the data query device provided in the embodiments of the present invention may be implemented by combining software and hardware, and as an example, the data query device provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the data query method provided in the embodiments of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex Programmable logic devices (CPLDs, complex Programmable Logic Device), field-Programmable gate arrays (FPGAs), or other electronic components.

As an example of implementation of the data query device provided by the embodiment of the present invention by combining software and hardware, the data query device provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, the software modules may be located in a storage medium, the storage medium is located in the memory 202, and the processor 201 reads executable instructions included in the software modules in the memory 202, and performs the data query method provided by the embodiment of the present invention in combination with necessary hardware (including, for example, the processor 201 and other components connected to the bus 205).

By way of example, the Processor 201 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, such as a microprocessor or any conventional Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

As an example of a hardware implementation of the data query device provided by the embodiment of the present invention, the device provided by the embodiment of the present invention may be directly implemented by the processor 201 in the form of a hardware decoding processor, for example, one or more Application specific integrated circuits (ASICs, application SPECIFIC INTEGRATED circuits), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex Programmable logic devices (CPLDs, complex Programmable Logic Device), field-Programmable gate arrays (FPGAs), or other electronic components to implement the data query method provided by the embodiment of the present invention.

The memory 202 in embodiments of the present invention is used to store various types of data to support the operation of the data querying device. Examples of such data include: any executable instructions, such as executable instructions, for operation on a data querying device, a program implementing the slave data querying method of an embodiment of the present invention may be contained in the executable instructions.

In other embodiments, the data query device provided in the embodiments of the present invention may be implemented in software, and fig. 2 shows the data query device stored in the memory 202, which may be software in the form of a program, a plug-in, or the like, and includes a series of modules, and as an example of the program stored in the memory 202, may include the data query device, where the data query device includes the following software modules: an information transmission module 2081 and an information processing module 2082. When the software modules in the data query device are read into the RAM by the processor 201 and executed, the data query method provided by the embodiment of the present invention is implemented, where the functions of each software module in the data query device include:

information transmission device 2081 is configured to receive an original structured query language instruction.

Information processing device 2082 is configured to obtain a first materialized view that matches the original structured query language instruction.

The information processing device 2082 is configured to rewrite the original structured query language instruction by using metadata of the first materialized view to obtain a target structured query language instruction.

The information processing device 2082 is configured to perform predicate compensation on the second materialized view that is matched with the target structured query language instruction when the physical data of the first materialized view is of a partition table type, so as to obtain a third materialized view.

The information processing device 2082 is configured to perform data query in the third materialized view, so as to obtain a query result of the original structured query language instruction.

According to the electronic device shown in fig. 2, in one aspect of the application, the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in various alternative implementations of the data querying method described above.

The data query method provided by the embodiment of the present invention is described with reference to the data query device shown in fig. 2, where in the process of data query, an incremental update operation may occur in a source table, and the nature of the incremental update is that the condition (adding, deleting, changing) of the data in the source table is obtained, then the changes occurring in the source table are synchronized to a target table, in order to achieve consistency of the results of the data query before and after the data incremental update, and in order to ensure accuracy of the data query results, referring to fig. 3, fig. 3 is an optional flowchart of the data query method provided by the embodiment of the present invention, and it is to be understood that the steps shown in fig. 3 may be performed by various electronic devices running the data query device, for example, may be a server with a data query function or a cloud server group. The server with the data query function may be the server 200 shown in fig. 1, and execute corresponding software modules in the data query device shown in fig. 2. The following is a description of the steps shown in fig. 3.

Step 301: the data query device receives an original structured query language instruction.

The structured query language (Structured Query Language) is called SQL for short, and is a database query and programming language for accessing data and querying, updating and managing a relational database system; the sql statement is a language for operating the database, and the original structured query language instruction may perform any of the following operations through the original structured query language instruction of step 301: 1) Querying a single column: SELECT column name FROM table name; ) Querying a plurality of columns: SELECT column name 1, column name; 3) Query all columns: SELECT FROM table name; 4) Querying different values: SELECTDISTINCT column names FROM table names; 5) And inquiring the limiting result, namely SELECTTOP rows and columns of names FROM table names.

In practice, in some embodiments of the invention, each SQL statement used in the database typically has a corresponding table to be queried, and each SQL statement typically has a corresponding column to be queried. Therefore, in order to improve the system operation efficiency, after the SQL statement is acquired, the column of the SQL statement query can be optimized. In one possible implementation, the SQL statement may be queried for which column or columns of the table based on predicates in the SQL statement, and then a determination may be made as to whether an index exists for that column. When the column does not have the index, the user is indicated to not build the index for the column, and the index can be directly built for the column at the moment, or prompt information is pushed to the user to build the index, wherein the prompt information can comprise SQL sentences used for building the index. For ease of distinction, the index-building hint information may be referred to as a second hint information. After pushing the index setup prompt to the user, the index may be set up for the column according to the user's instructions. For example, the index creating prompt information may include an SQL statement for creating an index, and the index creating prompt information may also include options for confirming that the index is created and not created separately or simultaneously. When the user selects the option for confirming the establishment of the index, the index can be established for the column directly; or the user can manually build the prompt information according to the index or build the index for the column according to the SQL statement for building the index included in the prompt information. By indexing columns that are not indexed, data query efficiency may be improved.

Step 302: the data query device obtains a first materialized view that matches the original structured query language instruction.

Wherein the materialized view used in step 302 for storing remote table based data, the materialized view created by the data warehouse is an aggregate view and may be updated periodically. The embodiment of the application adopts space time exchange, and generates the materialized view for the data table with large data volume and time-consuming processing, so that part of the query requests can directly hit the materialized view to obtain the query result, thereby further improving the query efficiency.

At the same time, materialized views may improve query performance by orders of magnitude by avoiding recalculating expensive query operations (e.g., join, sort, etc.). Materialized views are a database object that contains the results of a query. For example, it may be a local copy of the remote data, or it may be a subset of the rows and/or columns of the table or connection result. The materialized view may also be a digest using an aggregation function, and the index may be built on any column. Through materialized view, the query result is cached as a specific materialized table, and the table can be updated from time to time from the original base table, so that the access efficiency is higher, and meanwhile, the hardware resource consumption caused by equipment query is reduced.

In some embodiments of the present invention, in a database configuration stage, a grammar parser is required to parse a creation grammar of a materialized view, and a first materialized view is constructed, where the specific process includes: analyzing the creating grammar of the materialized view through a grammar analyzer to obtain an increment supported by the materialized view and a logic plan grammar tree, wherein the creating grammar support of the materialized view designates a materialized view chart as a partition table; analyzing the logic plan grammar tree to obtain the boundary of the increment of the materialized view; analyzing the logic plan grammar tree to obtain a table building element of a return result; and creating a physical table by using the table creating element, and executing the creating grammar of the materialized view according to the boundary of the increment of the materialized view to obtain a first materialized view. FIG. 4 is a schematic diagram of grammar parsing for creating a materialized view in an embodiment of the present invention, FIG. 5 is a schematic diagram of creating a table sentence in an embodiment of the present invention, and FIG. 6 is a schematic diagram of creating a table sentence in an embodiment of the present invention, after passing through the grammar parser, determining a value behind PARTITIONED ON as a partition column, namely hire _date, in a FORMAT of yyyMMdd specified by FORMAT, and resolving WITH AUTO INCREMENT as the materialized view support INCREMENT; then analyzing the SELECT statement behind the AS into a logic plan syntax tree, and utilizing the analysis result of the logic plan syntax tree to obtain the partition upper and lower boundaries of the increment materialized view, namely analyzing and finding the partition range of hire _date (20220501, 20220505); and simultaneously, the list construction element Schema (cd Bigint, al_ salary Bigint, cnt Bigint, locationid Int, hire _ DATE STRING) which returns results is parsed according to the grammar tree, so as to form a list construction sentence schematic diagram shown in fig. 5 and 6.

After executing the list-building sentence shown in fig. 5 and 6, the materialized view generated is shown in table 2:

TABLE 2

Time of inquiry
	202205011，20220505

In an embodiment of the present application, in addition to the syntax for creating a materialized view shown in FIG. 4, various options may be specified when creating a materialized view, including:

(1) Creation mode (BuildMethods): including two types of BUILD IMMEDIATE and BUILD DEFERRED.

The BUILD IMMEDIATE is the data that is generated when a materialized view is created.

The BUILD refer does not generate data at the time of creation, and then generates data as needed. For example, when inquiring the login information of the instant messaging client, the method can use the BUILD IMMEDIATE to generate data when creating the materialized view so as to completely record the login data; when the financial applet executes the transaction function, data is generated according to different use requirements of the user after the user triggers the funds transaction function through the BUILD DEFERRED.

(2) Query rewrite (QueryRewrite): including ENABLE QUERY REWRITE and DISABLE QUERY REWRITE.

ENABLE QUERY REWRITE and DISABLE QUERY REWRITE respectively indicate whether the created materialized view supports query rewrite. The query rewrite refers to that when a base table of a materialized view is queried, an Oracle database automatically judges whether a result can be obtained by querying the materialized view, and if so, aggregation or connection operation is avoided, so that data is directly read from the materialized view which is already calculated.

(3) Refresh (Refresh): refers to when and in what way the materialized view is synchronized with the base table when the DML operation occurs on the base table. There are two modes of refresh: ON DEMAND and ON COMMIT.

The difference between the ON DEMAND and the ON COMMIT materialized view is that the refresh methods are different, the ON DEMAND means that the materialized view is refreshed when the user needs, and the ON DEMAND can be refreshed manually by a method such as dbms_mview.refresh, or can be refreshed at the JOB timing, that is, the materialized view is updated, so that the consistency with the base table data is ensured, for example, when a financial applet executes a transaction function, the ON DEMAND instruction is used for updating in response to a refresh instruction of the user; when the ON COMMIT is used, once the basic table is submitted by a transaction (COMMIT), the basic table is refreshed immediately, and the materialized view is updated immediately, so that data is consistent with the basic table, for example, when a financial applet executes a transaction function, once a user submits a transaction requirement, the data is updated through an ON COMMIT instruction, and timely processing of the data is ensured.

After steps 301-302 are executed to obtain the first materialized view, the method can be executed by using a spark component, metadata of the first materialized view is stored in a database, and the stored metadata can support data query and can execute predicate compensation processing provided by the data query method.

Referring to fig. 7, fig. 7 is a schematic diagram of a table creation sentence executed by a spark component in an embodiment of the present invention, and fig. 8 is a schematic diagram of metadata information of a first materialized view in an embodiment of the present invention, where when a physical table is created by using a table creation element and a creating grammar of the materialized view is executed according to a boundary of an increment of the materialized view, to obtain the first materialized view, a corresponding physical table is created by jdbc connecting Hive frames, and finally, a sentence as shown in fig. 7 is executed by the spark component, where jdbc is connected with five ways:

1) Directly instantiating a Driver; 2) The Driver class is realized by reflection; 3) Replacing the Driver interface by using a Driver manager; 4) Automatically performing registration driving by using a driver implementation class of mysql, and directly calling a static method of a driver for connection; 5) And declaring the four basic information of the connection in a configuration file, and reading the configuration file to connect. The first materialized view data is written into the oms.jhonyjmv_partition_10_spjg_105_increment table shown in FIG. 7, where the partition scope of the view_expanded_text field is (20220501, 20220505). At this time, as shown in fig. 8, metadata information of the first materialized view is stored in the database.

Step 303: the data query device rewrites the original structured query language instruction by utilizing the metadata of the first materialized view to obtain the target structured query language instruction.

In some embodiments of the present invention, since the original SQL instruction, after being parsed and verified, needs to be optimized by the query optimizer component for execution plan to form a target SQL statement, the modifying instruction in the SQL statement can be utilized by a user in the process to obtain the second chemical view, which specifically includes:

Acquiring a modification instruction of a structured query language; analyzing the modification instruction of the structured query language to obtain an increment boundary of the second chemical view; and replacing the increment boundary of the first materialized view by the increment boundary of the second materialized view to obtain the second materialized view, wherein fig. 9 is a schematic diagram of a modification instruction in the embodiment of the invention, and in the modification instruction shown in fig. 9, the processing of the join instruction carries a corresponding increment boundary, (20220505, 20220510) which indicates that the time range of the query of the user is 20220505-20220510 at the moment.

Fig. 10 is a materialized view increment processing statement indication diagram in an embodiment of the present application, and fig. 11 is a metadata information schematic diagram of a second materialized view in an embodiment of the present application, specifically, a syntax parsing engine parses out an increment boundary (20220505, 20220510) carried in fig. 9, replaces an increment boundary of SQL in a view_expanded_text field in a first materialized view, and executes a statement shown in fig. 10 through a spark component to perform materialized view increment, where, for a data table in the present application, a field according to which a join operation is performed on each base table is a join field. For example, the fields of the join field connection in the materialized view increment shown in FIG. 10 are: oms-parts-10 and omsdepts-parts-10.

In the process shown in fig. 11, the partition range of SQL in the view_expanded_text field becomes (20220501, 20220510), the materialized view increment is completed, a new materialized view is formed, and the data in the oms.jhonyjmv_part_10_spjg_105_increment table also changes to data of 5 months 1 to 5 months 9.

In some embodiments of the present invention, the physical data of the first materialized view needs to be detected before step 304 is performed; when the physical data of the first materialized view is of a partition table type, triggering a predicate compensation process, wherein the predicate compensation process is used for performing predicate compensation on the second materialized view; when the physical data of the first materialized view is of a non-partition table type, prompt information is sent, wherein the prompt information is used for adjusting the creation grammar of the materialized view, and in combination with the embodiment shown in the preamble fig. 7 and 8, when the data is queried, if data from 5 months 5 to 9 months are written, the SQL sentence is just matched with materialized view data from 5 months 1 to 5 months 4, and at the moment, the query engine directly executes the following SQL sentence:

Select from oms jhonylemv part 10 spjg 105 incrustal. Then, all data in the table is returned, but the oms.jhonyjmv_part_10_spjg_105_increment table has data from 5 months 5 to 5 months 9, so that additional data can cause query failure, at this time, if the physical data of the first materialized view is of a non-partition table type, hire _date needs to be recalculated, and if a large number of other predicates (i.e. WHERE conditions) exist, the first materialized view needs to be subjected to a large number of repeated calculations, so that the query efficiency of the materialized view is seriously affected. Therefore, a determination needs to be made as to whether the physical data of the first materialized view is of the partition table type.

Furthermore, for complex query operations, if the join condition in the data table is an equality predicate, a join column of null generation operands may be added to the output of the materialized view, or a non-null column may be added to the null generation operands. The equation predicate means that if the columns have the same value, they will be compared. For example, if both the first table T (time T1 integer) and the second table S (time T2 integer) have a single column of integer data types, the equation predicate will be "t.t1=sβ2,", meaning that T1 column of T row is being compared with T2 column of another S row, and the corresponding query result may be presented to the user according to the comparison result.

Step 304: and when the physical data of the first materialized view is of the partition table type, the data query device performs predicate compensation on the second materialized view matched with the target structured query language instruction to obtain a third materialized view.

Before explaining the processing procedure of step 304, the concepts of the partition table and predicates are explained first, where the partition table is a large logical table formed by using several tables with identical physical table structures through a certain algorithm. This algorithm is called "partition function", and the partition function type supported by the current MySQL database is RANGE, LIST, HASH, KEY, COLUMNS. Whichever partition function is selected, the relevant columns are designated as input conditions for the partitioning algorithm, and these columns are called "partition columns".

In some embodiments of the present invention, in a certain scenario, it is required to delete historical data of a certain year, delete the historical data with delete … from … where … when using a common table, and this SQL is executed quite slowly, and a large number of binary logs are generated, which also causes problems of database master-slave delay on the production system

In MySQL partition tables, the primary key of the partition table must include a column of partition functions, and the unique index must also include a column of partition functions. This SQL is performed very fast because the actual execution is to delete and reconstruct the partition file. In addition, only one DDL log is generated, and the problem of master-slave replication delay is not caused, so that the partition table can be utilized to realize high-efficiency processing of data.

Predicates (predicates), one of the functions introduced in the commonly used functions of SQL, are functions that need to satisfy a specific condition that the return value is a true value. For a typical function, the return values may be numbers, strings, dates, etc., but the return values of predicates are all TRUE values (TRUE/FALSE/UNKNOWN). This is also the largest difference between predicates and functions.

In the present application, predicates include, but are not limited to: LIKE predicate-a partially consistent query of a string; BETWEEN predicate-Range query; IS NULL, IS NOT NULL predicate-determine whether NULL; IN predicate-simple usage of OR.

To ensure accuracy of predicate compensation, in some embodiments of the present invention, it is desirable that there be all the number of rows needed for a query in a materialized view, i.e., the range of predicates for the view is greater than or equal to the range of the query. Using Wq to represent the predicate of a query, wv to represent the predicate of a view, it is necessary to guarantee wq= > Wv, where = > represents the meaning that Wq satisfies Wv. By detecting the accuracy of predicate compensation, the success rate of query operation can be further improved by ensuring that the range of predicates of the view is greater than or equal to the range of query.

In performing predicate compensation through step 304, a partition field of physical data of the first materialized view and a predicate type may be first determined; then compensating the range of the first boundary predicate of the partition field by using the predicate type to obtain the range of the second boundary predicate; and finally, according to the range of the second boundary predicate, adjusting the second materialized view to obtain a third materialized view. Specifically, the processing of the set of preamble steps 301 to 304 is completed because the partition table oms.jhonyjmv_partition_10_spjg_105_increment is already created, and the physical data of the first materialized view is determined to be the partition table type; the dynamic data management framework Calcite, upon receiving an SQL statement, will iteratively optimize the rewrite logic plan by Volcano Planner (volcano planner of calcite is also rule and cost model based, so the addRule method is typically invoked to add a series of rules to the planner).

According to the data query method provided by the application, the physical table associated with the materialized view is read in the perform function process, the table is judged to be the partition table, and the partition field read to the partition table is hire _date, at this time, when the matched materialized view rule rewrites the logic plan of the user, the upper and lower bound predicates of hire _date are necessarily compensated, so that WHERE HIRE _date > = '20220501'and hire_date < '20220505' is added in the SQL statement to obtain a third materialized view, wherein fig. 12 is a schematic diagram of the SQL statement corresponding to the third materialized view in the embodiment of the application, and as shown in fig. 12, the SQL statement comprises: WHERE HIRE _date > = '20220501'and hire_date < '20220505'.

Step 305: and the data query device performs data query in the third materialized view to obtain a query result of the original structured query language instruction.

Because the data processing system may process a large number of SQL sentences at the same time, when processing a large number of SQL sentences, the data processing system may first receive job data to be processed and submit the job data to be processed to the cluster resource manager, where the job data to be processed includes at least two original structured query language instructions; triggering corresponding components according to the job data to be processed through a cluster resource manager, and converting an original structured query language instruction in the job data to be processed into a task matched with a target computing engine; and carrying out data query by using a third materialized view set corresponding to the to-be-processed job data through the target calculation engine to obtain a query result of the to-be-processed job data, wherein the third materialized view set comprises at least two third materialized views.

In the following, the computing engine Spark and the data bin tool Hive according to the present application will be described, specifically, referring to fig. 13, fig. 13 is a schematic diagram of an operation architecture of Spark in a Cluster mode in the present application, where in the related art, only a Hive on Spark framework is supported to operate on an open source resource scheduling platform, where the Spark operates on a schematic diagram on an open source resource scheduling platform, and a Cluster Manager (Cluster Manager) may be an open source resource scheduling platform such as yann, meso, kubernetes, or the like. Spark itself already supports these open source platforms, i.e. the protocols between Spark and ClusterManager components are compatible. Driver is a job Driver, work Node is a Work Node, executor is a task execution component, and task is the smallest execution unit.

Further, a structured data package (Spark SQL) is a package used by Spark to manipulate structured data, through which the data can be queried using the SQL language, which supports a variety of data sources such as data warehouse tools (Hive) tables, and the like. The streaming component is a Spark provided component that streams real-time data, providing an application programming interface (API Application Programming Interface) for manipulating the data stream. The machine learning program library provides a program library with common machine learning functions, including classification, regression, clustering, collaborative filtering and the like, and also provides additional support functions of model evaluation, data import and the like. The set of tools for graph operations and computations is a set of algorithms and tools for control graph, parallel graph operations and computations.

With continued reference to fig. 14, fig. 14 is a schematic diagram of a Hive runtime architecture according to an embodiment of the present invention, where the Hive runtime architecture is divided into an interface layer, a service layer, a computing layer, a scheduling layer, and a storage layer. The interface layer is used for users to submit a Hive job, namely HQL statement, through a Hive Web UI webpage interface, a JDBC/ODBC interface or a Hive CLI command line. The service layer is used for analyzing the HQL statement into a MapReduce or Spark task, wherein HIVESERVER is used for receiving a request sent by the JDBC/ODBC; HIVE DRIVER is a driver that converts HQL into MapReduce or Spark tasks by compiling, optimizing, and executing three steps; metaStore is a metadata base that stores a mapping relationship between Hive base tables and HDFS data, and the like. The computing layer runs the parsed Hive tasks on the distributed nodes through corresponding computing engines, which have MapReduce, spark, tez and the like, and the Spark computing engine is mainly involved. The scheduling layer is used for distributing computing nodes and resources to computing tasks, and YARN is an open-source resource scheduling platform. The storage layer is used to store Hive data, and a common distributed storage system is HDFS.

The embodiment of the invention can be realized by combining Cloud technology, wherein Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data, and can also be understood as the general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a Cloud computing business model. Background services of technical network systems require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites, so cloud technologies need to be supported by cloud computing.

It should be noted that cloud computing is a computing mode, which distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space and information service as required. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed. As a basic capability provider of cloud computing, a cloud computing resource pool platform, referred to as a cloud platform for short, is generally called an Infrastructure as a service (IaaS, infrasaround AS A SERVICE), and multiple types of virtual resources are deployed in the resource pool for external clients to select for use. The cloud computing resource pool mainly comprises: computing devices (which may be virtualized machines, including operating systems), storage devices, and network devices. When a user uses the cloud server to store data or deploys different application processes, the data in different databases can be queried and adjusted in time.

Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside. At present, the storage method of the storage system is as follows: when creating logical volumes, each logical volume is allocated a physical storage space, which may be a disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as a data Identification (ID) and the like, the file system writes each object into a physical storage space of the logical volume, and the file system records storage position information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage position information of each object. The process of allocating physical storage space for the logical volume by the storage system specifically includes: physical storage space is divided into stripes in advance according to the set of capacity measures for objects stored on a logical volume (which measures tend to have a large margin with respect to the capacity of the object actually to be stored) and redundant array of independent disks (RAID, redundant Array of INDEPENDENT DISK), and a logical volume can be understood as a stripe, whereby physical storage space is allocated for the logical volume.

When the method is applied to cloud products, the front end of the cloud products can be a Web UI component used for receiving Spark related parameters filled by users and generating to-be-processed job data according to the Spark related parameters, wherein the to-be-processed job data comprises at least two original structured query language instructions. The Cluster Manager (Cluster Manager) may be an open source Cluster resource scheduling platform such as YARN, mesos or Kubernetes. Spark itself already supports these open source platforms, i.e. the protocols between Spark and ClusterManager components are compatible. Driver is a job Driver, work Node is a Work Node, executor is a task execution component, and task is the smallest execution unit. Further, a structured data package (Spark SQL) is a package used by Spark to manipulate structured data, through which the data can be queried using the SQL language, which supports a variety of data sources such as data warehouse tools (Hive) tables, and the like. The streaming component is a Spark provided component that streams real-time data, providing an application programming interface (API Application Programming Interface) for manipulating the data stream.

In some embodiments of the present invention, the conversion of the original structured query language instruction in the job data to a task matching the target compute engine may be achieved by triggering the corresponding component according to the job data to be processed by:

Triggering a node manager in different service clusters according to job data to be processed through the cluster resource manager; starting a job manager of the resource scheduling system through the triggered node manager, and converting an object-oriented query language instruction in job data to be processed into a task matched with a corresponding computing engine through the job manager of the resource scheduling system; the job manager of the compute engine is triggered by the job manager of the resource scheduling system. Therefore, the cluster resource manager can be utilized to promote the processing of batch SQL sentences, and the data query time of a user is saved.

In order to better illustrate the working process of the data query method provided by the present application, the data query method provided by the present application is described below by taking the account balance of multi-user query as an example, referring to table 2

TABLE 2

When the data range of the current materialized view is 202201 to 202202 and the data of 202203 is incremented, if the user 30000 queries the data of 202201 to 202202, if the increment job writes the data of 202203 into the materialized view chart and the metadata is not updated in time, the data scanned to 202203 by the query engine is caused, and the query result is abnormal. Similarly, for user 30002, querying 20220505 through 20220510 may also result in query failure if 20220501 through 20220502 data is written into a physical visual chart.

Referring to fig. 15, fig. 15 is a schematic process diagram of a data query method according to an embodiment of the present invention, which specifically includes the following steps:

Step 1501: receiving an original structured query language instruction;

step 1502: acquiring a first materialized view matched with an original structured query language instruction;

step 1503: rewriting the original structured query language instruction by utilizing the metadata of the first materialized view to obtain a target structured query language instruction;

Step 1504: when the physical data of the first materialized view is of a partition table type, compensating the range of the first boundary predicate of the partition field by utilizing the predicate type to obtain the range of the second boundary predicate;

Step 1505: according to the range of the second boundary predicate, adjusting the second materialized view to obtain a third materialized view;

Step 1506: and carrying out data query in the third materialized view to obtain and return a query result of the original structured query language instruction.

At this time, for the query instruction of the user 30002, WHERE HIRE _date > = '20220501'and hire_date < '20220505' is added in the user's SQL, so that not only is the update of the materialized view realized, but also predicate compensation is performed on the rewritten SQL, only the data of the partition column hire _date in the range of 20220501 to 20220504 in the materialized view is hit, and the accuracy of the query is ensured.

The beneficial technical effects are as follows:

The embodiment of the invention obtains a first materialized view matched with an original structured query language instruction; rewriting the original structured query language instruction by utilizing the metadata of the first materialized view to obtain a target structured query language instruction; when the physical data of the first materialized view is of a partition table type, predicate compensation is carried out on the second materialized view matched with the target structured query language instruction, and a third materialized view is obtained; and carrying out data query in the third materialized view to obtain a query result of the original structured query language instruction, thereby ensuring the performance of a calculation engine and the performance of the materialized view and obtaining an accurate data query result by carrying out predicate compensation on the materialized view.

The above embodiments are merely examples of the present invention, and are not intended to limit the scope of the present invention, so any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of querying data, the method comprising:

Receiving an original structured query language instruction;

2. The method according to claim 1, wherein the method further comprises:

Analyzing the creating grammar of the materialized view through a grammar analyzer to obtain an increment supported by the materialized view and a logic plan grammar tree, wherein the creating grammar of the materialized view supports designating a materialized view chart as a partition table;

analyzing the logic plan grammar tree to obtain a boundary of a materialized view increment;

analyzing the logic plan grammar tree to obtain a table building element of a return result;

and creating a physical table by using the table creating element, and executing the creating grammar of the materialized view according to the boundary of the materialized view increment to obtain the first materialized view.

3. The method of claim 1, wherein when the physical data of the first materialized view is of a partition table type, performing predicate compensation on a second materialized view matched with the target structured query language instruction to obtain a third materialized view, comprising:

determining a partition field and a predicate type of physical data of the first materialized view;

compensating the range of the first boundary predicate of the partition field by utilizing the predicate type to obtain the range of the second boundary predicate;

And adjusting the second materialized view according to the range of the second boundary predicate to obtain the third materialized view.

4. The method according to claim 1, wherein the method further comprises:

acquiring a modification instruction of a structured query language;

analyzing the modification instruction of the structured query language to obtain an increment boundary of the second chemical view;

And replacing the increment boundary of the first materialized view by the increment boundary of the second materialized view to obtain the second materialized view.

5. The method according to claim 1, wherein the method further comprises:

receiving job data to be processed, and submitting the job data to be processed to a cluster resource manager, wherein the job data to be processed comprises at least two original structured query language instructions;

Triggering corresponding components according to the job data to be processed through the cluster resource manager, and converting an original structured query language instruction in the job data to be processed into a task matched with a target computing engine;

And carrying out data query by the target calculation engine by utilizing a third materialized view set corresponding to the to-be-processed job data to obtain a query result of the to-be-processed job data, wherein the third materialized view set comprises at least two third materialized views.

6. The method of claim 5, wherein the step of converting, by the cluster resource manager, the original structured query language instruction in the job data to a task matching the target compute engine by triggering a corresponding component based on the job data to be processed comprises:

triggering a node manager in different service clusters according to the job data to be processed through the cluster resource manager;

Starting a job manager of the resource scheduling system through the triggered node manager, and converting an object-oriented query language instruction in the job data to be processed into a task matched with a corresponding computing engine through the job manager of the resource scheduling system;

triggering a job manager of the compute engine by the job manager of the resource scheduling system.

7. The method according to claim 1, wherein the method further comprises:

detecting physical data of the first materialized view;

Triggering a predicate compensation process when the physical data of the first materialized view is of a partition table type, wherein the predicate compensation process is used for performing predicate compensation on the second materialized view;

And when the physical data of the first materialized view is of a non-partition table type, sending prompt information, wherein the prompt information is used for adjusting the creating grammar of the materialized view.

8. A data querying device, the device comprising:

9. An electronic device, the electronic device comprising:

a memory for storing executable instructions;

a processor for implementing the data query method of any one of claims 1 to 7 when executing executable instructions stored in said memory.

10. A software program, characterized in that the software program comprises:

a memory for storing executable instructions;

11. A computer readable storage medium storing executable instructions which when executed by a processor implement the data query method of any one of claims 1 to 7.