CN112817989B - Data processing method, data processing device, storage medium and electronic equipment - Google Patents

Data processing method, data processing device, storage medium and electronic equipment Download PDF

Info

Publication number
CN112817989B
CN112817989B CN202110088454.9A CN202110088454A CN112817989B CN 112817989 B CN112817989 B CN 112817989B CN 202110088454 A CN202110088454 A CN 202110088454A CN 112817989 B CN112817989 B CN 112817989B
Authority
CN
China
Prior art keywords
data
operation information
query engine
metadata
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110088454.9A
Other languages
Chinese (zh)
Other versions
CN112817989A (en
Inventor
汪源
余利华
蒋鸿翔
郭忆
温正湖
汪胜
王刚
李继业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202110088454.9A priority Critical patent/CN112817989B/en
Publication of CN112817989A publication Critical patent/CN112817989A/en
Application granted granted Critical
Publication of CN112817989B publication Critical patent/CN112817989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2358Change logging, detection, and notification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure relates to a data processing method, a data processing device, a storage medium and electronic equipment, and relates to the technical field of data processing. The data processing method comprises the following steps: acquiring data operation information for one or more data groups in a data warehouse; writing the data operation information of the data group meeting the data synchronization condition into a log library so that a data query engine updates cache data of the data query engine by acquiring the data operation information in the log library. The method and the device realize automatic updating of the cache data of the data query engine, and enable the data warehouse information in the cache to be consistent with the actual information in the data warehouse, so that the accuracy of the result of the data query engine executing the data query task according to the data warehouse information in the cache is ensured.

Description

Data processing method, data processing device, storage medium and electronic equipment
Technical Field
Embodiments of the present disclosure relate to the field of data processing technology, and more particularly, to a data processing method, a data processing apparatus, a computer readable storage medium, and an electronic device.
Background
This section is intended to provide a background or context for the embodiments of the disclosure recited in the claims, which description herein is not admitted to be prior art by inclusion in this section.
With the advent of the big data age, many enterprises or organizations have deployed specialized data query engines, such as opala based on Apache Hadoop ecology, for efficient processing and analysis of data. These data query engines are typically set up independently of the data warehouse, are more flexible than the query functions of the data warehouse itself, and can exhibit superior performance in some respects, such as Impala being suitable for implementing fast real-time queries for medium-level tasks.
In some related data query engines, related information of a data warehouse is usually stored, for example, imala caches metadata of data tables in the data warehouse, so that required metadata can be directly obtained from local when a query task is executed, thereby improving query efficiency.
Disclosure of Invention
However, in the related data query engine, the stored data warehouse information cannot be effectively updated, so that the data warehouse information stored in the data query engine is not synchronous with the actual data warehouse information, and further, the data query result is wrong.
Therefore, a data processing method is very needed, and the consistency of the information of the data query engine and the data warehouse can be realized, so that the accuracy of the data query result is ensured.
In this context, embodiments of the present disclosure desirably provide a data processing method, a data processing apparatus, a computer-readable storage medium, and an electronic device.
According to a first aspect of embodiments of the present disclosure, there is provided a data processing method, including: acquiring data operation information for one or more data groups in a data warehouse; writing the data operation information of the data group meeting the data synchronization condition into a log library so that a data query engine updates cache data of the data query engine by acquiring the data operation information in the log library.
In one embodiment of the present disclosure, the writing the data operation information of the data group satisfying the data synchronization condition into the log library includes: acquiring synchronization attribute parameters of the one or more data sets; and writing the data operation information of the data group with the synchronization attribute parameter being a preset value into the log library.
In one embodiment of the present disclosure, the method further comprises: taking attribute operation information aiming at the synchronous attribute parameters; and writing the attribute operation information into the log library so that the data query engine obtains the attribute operation information in the log library.
In one embodiment of the disclosure, the writing the attribute operation information to the log library includes: and when the attribute operation information is to change the synchronous attribute parameter from a non-preset value to the preset value, writing the attribute operation information into the log library.
In one embodiment of the present disclosure, the writing the data operation information of the data group satisfying the data synchronization condition into the log library includes: and writing the data operation information of the data group on the white list into the log library.
In one embodiment of the present disclosure, the obtaining data manipulation information for one or more data groups in a data warehouse includes: the data manipulation information for one or more data groups in the data warehouse is obtained from a metadata storage component of the data warehouse.
In one embodiment of the present disclosure, the writing the data operation information of the data group satisfying the data synchronization condition into the log library includes: and writing the data operation information of the data group meeting the data synchronization condition into a log library through a log writing thread in the data warehouse.
In one embodiment of the present disclosure, the data set includes one or more data partitions; the writing the data operation information of the data group meeting the data synchronization condition into a log library comprises the following steps: determining the changed data partition in the data group meeting the data synchronization condition; and writing the data operation information of the changed data partition into the log library.
In one embodiment of the present disclosure, the data operation information of the changed data partition includes: the method comprises the steps of data group identification, operation type, data partition identification before operation and data partition identification after operation.
In one embodiment of the present disclosure, the data manipulation information comprises data manipulation information based on a data definition language.
According to a second aspect of embodiments of the present disclosure, there is provided a data processing method, comprising: acquiring data operation information from a log library; the log library is used for storing data operation information of a data group meeting data synchronization conditions in the data warehouse; and updating the cache data of the data query engine according to the data operation information.
In one embodiment of the present disclosure, the cache of the data query engine is used to store metadata for one or more data groups in the data warehouse; the updating the cache data of the data query engine according to the data operation information comprises the following steps: and updating metadata of a data group to be updated in a cache of the data query engine according to the data operation information, wherein the data group to be updated is the data group corresponding to the data operation information.
In one embodiment of the present disclosure, the updating the metadata of the data group to be updated in the cache of the data query engine according to the data operation information includes: combining the data operation information corresponding to the same data group to be updated; and updating the metadata of the data group to be updated in the cache of the data query engine according to the combined data operation information.
In one embodiment of the present disclosure, the data set includes one or more data partitions; the data operation information includes: the method comprises the steps of data group identification, operation type, data partition identification before operation and data partition identification after operation; the updating the metadata of the data group to be updated in the cache of the data query engine according to the data operation information comprises the following steps: searching a data group corresponding to the data group identifier in the data warehouse, and acquiring metadata of a data partition corresponding to the operated data partition identifier from the searched data group; searching a data group to be updated corresponding to the data group identifier in a cache of the data query engine, and updating a data partition corresponding to the data partition identifier before operation in the data group to be updated according to the acquired metadata.
In one embodiment of the present disclosure, when the operation type is a newly added data partition, the data partition before the operation is identified as a null value; the updating the data partition corresponding to the data partition identifier before the operation in the data group to be updated according to the acquired metadata comprises the following steps: and establishing a new data partition in the data group to be updated, and updating the new data partition according to the acquired metadata.
In one embodiment of the present disclosure, the method further comprises: and counting the metadata in the cache of the data query engine to update metadata statistics.
In one embodiment of the disclosure, the counting metadata in the cache of the data query engine includes: determining a data set to be counted, wherein the data set to be counted is a data set updated after the last time of counting metadata; and counting the metadata of the data group to be counted in the cache of the data query engine.
In one embodiment of the disclosure, the counting metadata in the cache of the data query engine includes: and when the statistics triggering condition is met, counting the metadata in the cache of the data query engine.
In one embodiment of the present disclosure, the statistical trigger condition includes at least one of: and the proportion of the data group to be counted in all metadata in the cache of the data query engine reaches a preset proportion, the preset counting time is reached, and the counting triggering operation input by the user is received.
In one embodiment of the disclosure, the counting metadata in the cache of the data query engine to update metadata statistics includes: and counting the metadata in the cache of the data query engine through a counting thread in the data query engine so as to update metadata statistics information.
In one embodiment of the present disclosure, after updating the metadata statistics, the method further comprises: and executing a data query task based on the metadata statistical information.
In one embodiment of the disclosure, the data query task includes a connection query task; the executing the data query task based on the metadata statistical information comprises the following steps: determining a plurality of data groups associated with the connection query task in a cache of the data query engine; determining a driving dataset among the plurality of datasets based on metadata statistics for the plurality of datasets; broadcasting the data of the driving data set to a plurality of execution nodes of the data query engine, so that the plurality of execution nodes jointly execute the connection query task according to the data of the driving data set.
In one embodiment of the present disclosure, the obtaining data operation information from the log repository includes: and when the synchronous triggering condition is met, reading the newly added data operation information in the log library.
In one embodiment of the present disclosure, the synchronization trigger condition includes at least one of: and the newly added data operation information in the log library reaches the preset quantity and the preset synchronous time, and synchronous triggering operation input by a user is received.
In one embodiment of the disclosure, the log library is further used to store attribute operation information for synchronization attribute parameters of the data set; the method further comprises the steps of: acquiring the attribute operation information from the log library; and updating the data synchronization state of the data group corresponding to the attribute operation information in the data query engine according to the attribute operation information.
In one embodiment of the present disclosure, the updating, according to the attribute operation information, a data synchronization state of a data group corresponding to the attribute operation information in the data query engine includes: when the attribute operation information is that the synchronous attribute parameter is changed from a non-preset value to a preset value, the data synchronous state of the data group corresponding to the attribute operation information in the data query engine is updated from asynchronous to synchronous.
In one embodiment of the disclosure, updating the cache data of the data query engine according to the data operation information includes: and loading the data operation information through a synchronous thread in the data query engine, and updating cache data of the data query engine according to the data operation information.
According to a third aspect of embodiments of the present disclosure, there is provided a data processing apparatus comprising: an operation information acquisition module configured to acquire data operation information for one or more data groups in the data warehouse; and the operation information writing module is configured to write the data operation information of the data group meeting the data synchronization condition into a log library so that a data query engine updates cache data of the data query engine by acquiring the data operation information in the log library.
In one embodiment of the present disclosure, the operation information writing module is configured to: acquiring synchronization attribute parameters of the one or more data sets; and writing the data operation information of the data group with the synchronization attribute parameter being a preset value into the log library.
In one embodiment of the present disclosure, the operation information obtaining module is configured to obtain attribute operation information for the synchronization attribute parameter; the operation information writing module is configured to write the attribute operation information into the log library so that the data query engine obtains the attribute operation information in the log library.
In one embodiment of the present disclosure, the operation information writing module is configured to write the attribute operation information into the log library when the attribute operation information is to change the synchronization attribute parameter from a non-preset value to the preset value.
In one embodiment of the disclosure, the operation information writing module is configured to write data operation information of the data group located on a white list to the log library.
In one embodiment of the present disclosure, the operation information acquisition module is configured to acquire the data operation information for one or more data groups in the data warehouse from a metadata storage component of the data warehouse.
In one embodiment of the disclosure, the operation information writing module is configured to write, by a log writing thread in the data warehouse, data operation information of the data group satisfying a data synchronization condition to a log library.
In one embodiment of the present disclosure, the data set includes one or more data partitions; the operation information writing module is configured to: determining the changed data partition in the data group meeting the data synchronization condition; and writing the data operation information of the changed data partition into the log library.
In one embodiment of the present disclosure, the data operation information of the changed data partition includes: the method comprises the steps of data group identification, operation type, data partition identification before operation and data partition identification after operation.
In one embodiment of the present disclosure, the data manipulation information comprises data manipulation information based on a data definition language.
According to a fourth aspect of embodiments of the present disclosure, there is provided a data processing apparatus comprising: the operation information acquisition module is configured to acquire data operation information from the log library; the log library is used for storing data operation information of a data group meeting data synchronization conditions in the data warehouse; and the cache data updating module is configured to update cache data of the data query engine according to the data operation information.
In one embodiment of the present disclosure, the cache of the data query engine is used to store metadata for one or more data groups in the data warehouse; and the cache data updating module is configured to update metadata of a data group to be updated in a cache of the data query engine according to the data operation information, wherein the data group to be updated is the data group corresponding to the data operation information.
In one embodiment of the disclosure, the cache data updating module is configured to: combining the data operation information corresponding to the same data group to be updated; and updating the metadata of the data group to be updated in the cache of the data query engine according to the combined data operation information.
In one embodiment of the present disclosure, the data set includes one or more data partitions; the data operation information includes: the method comprises the steps of data group identification, operation type, data partition identification before operation and data partition identification after operation; the cache data updating module is configured to: searching a data group corresponding to the data group identifier in the data warehouse, and acquiring metadata of a data partition corresponding to the operated data partition identifier from the searched data group; searching a data group to be updated corresponding to the data group identifier in a cache of the data query engine, and updating a data partition corresponding to the data partition identifier before operation in the data group to be updated according to the acquired metadata.
In one embodiment of the present disclosure, when the operation type is a newly added data partition, the data partition before the operation is identified as a null value; the cache data updating module is configured to establish a new data partition in the data group to be updated, and update the new data partition according to the acquired metadata.
In one embodiment of the present disclosure, the apparatus further comprises: and the metadata statistics module is configured to count metadata in a cache of the data query engine so as to update metadata statistics information.
In one embodiment of the disclosure, the metadata statistics module is configured to: determining a data set to be counted, wherein the data set to be counted is a data set updated after the last time of counting metadata; and counting the metadata of the data group to be counted in the cache of the data query engine.
In one embodiment of the disclosure, the metadata statistics module is configured to perform statistics on metadata in a cache of the data query engine when a statistics trigger condition is satisfied.
In one embodiment of the present disclosure, the statistical trigger condition includes at least one of: and the proportion of the data group to be counted in all metadata in the cache of the data query engine reaches a preset proportion, the preset counting time is reached, and the counting triggering operation input by the user is received.
In one embodiment of the disclosure, the metadata statistics module is configured to perform statistics on metadata in a cache of the data query engine by a statistics thread in the data query engine to update metadata statistics.
In one embodiment of the present disclosure, the apparatus further comprises: and the query task processing module is configured to execute a data query task based on the metadata statistical information.
In one embodiment of the disclosure, the data query task includes a connection query task; the query task processing module is configured to: determining a plurality of data groups associated with the connection query task in a cache of the data query engine; determining a driving dataset among the plurality of datasets based on metadata statistics for the plurality of datasets; broadcasting the data of the driving data set to a plurality of execution nodes of the data query engine, so that the plurality of execution nodes jointly execute the connection query task according to the data of the driving data set.
In one embodiment of the disclosure, the operation information obtaining module is configured to read the newly added data operation information in the log library when the synchronization trigger condition is satisfied.
In one embodiment of the present disclosure, the synchronization trigger condition includes at least one of: and the newly added data operation information in the log library reaches the preset quantity and the preset synchronous time, and synchronous triggering operation input by a user is received.
In one embodiment of the disclosure, the log library is further used to store attribute operation information for synchronization attribute parameters of the data set; the operation information acquisition module is configured to acquire the attribute operation information from the log library; and the cache data updating module is configured to update the data synchronization state of the data group corresponding to the attribute operation information in the data query engine according to the attribute operation information.
In one embodiment of the disclosure, the cache data updating module is configured to update the data synchronization state of the data group corresponding to the attribute operation information in the data query engine from asynchronous to synchronous when the attribute operation information is to change the synchronous attribute parameter from a non-preset value to a preset value.
In one embodiment of the disclosure, the cache data updating module is configured to load the data operation information through a synchronization thread in the data query engine, and update cache data of the data query engine according to the data operation information.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described data processing methods.
According to a sixth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any of the data processing methods described above via execution of the executable instructions.
According to the data processing method, the data processing device, the computer readable storage medium and the electronic equipment, on one hand, when the data in the data warehouse changes and the related information of the data warehouse changes, the data warehouse information stored in the cache of the data query engine can be automatically updated, so that the data query engine is consistent with the information in the data warehouse, the accuracy of the result of the data query task executed by the data query engine according to the data warehouse information in the cache is ensured, manual intervention is not needed in the updating process, and the user experience is good. On the other hand, the scheme realizes the simplification of the data operation information stored in the log library, reduces the writing cost of writing the data operation information into the log library by the data warehouse and the storage cost of storing the data operation information by the log library, and simultaneously reduces the reading cost of acquiring the data operation information from the log library by the data query engine and the calculation cost of processing the data operation information, thereby improving the performance and the processing efficiency of the whole system.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:
FIG. 1 shows a schematic diagram of an Impala architecture;
FIG. 2 shows a schematic diagram of an Impala performing a data query task;
FIG. 3 illustrates a system architecture diagram of an operating environment in an embodiment of the present disclosure;
FIG. 4 illustrates a flow chart of a data processing method performed by a data warehouse in an embodiment of the present disclosure;
FIG. 5 illustrates a flow chart of data manipulation information processing according to synchronization attribute parameters in an embodiment of the present disclosure;
FIG. 6 illustrates a flow chart of an attribute operation information process in an embodiment of the present disclosure;
FIG. 7 shows a flow architecture diagram of a Hive-based data processing method in an embodiment of the present disclosure;
FIG. 8 illustrates a flow chart of a data processing method performed by a data query engine in an embodiment of the present disclosure;
FIG. 9 illustrates a flow chart for updating cached data in an embodiment of the present disclosure;
FIG. 10 illustrates a flow chart of performing a connection query task in an embodiment of the present disclosure;
Fig. 11 shows a schematic diagram of a broadcast driving table in an embodiment of the present disclosure;
FIG. 12 illustrates a flow chart of updating a data synchronization state in an embodiment of the present disclosure;
FIG. 13 shows a flow architecture diagram of an Impala-based data processing method in an embodiment of the present disclosure;
FIG. 14 shows a flow architecture diagram of a Hive and Impala based data processing method in an embodiment of the present disclosure;
FIG. 15 shows a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;
FIG. 16 shows a schematic diagram of another data processing apparatus according to an embodiment of the present disclosure; and
fig. 17 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present disclosure will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present disclosure and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Those skilled in the art will appreciate that embodiments of the present disclosure may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to an embodiment of the present disclosure, there are provided a data processing method, a data processing apparatus, a computer-readable storage medium, and an electronic device.
Any number of elements in the figures are for illustration and not limitation, and any naming is used for distinction only, and not for any limiting sense.
The principles and spirit of the present disclosure are described in detail below with reference to several representative embodiments thereof.
Summary of The Invention
In the related data query engine, the stored data warehouse information cannot be effectively updated, so that the data warehouse information stored in the data query engine is not synchronous with the actual data warehouse information, and further, the data query result is wrong. The following specifically describes an example of the data warehouse Hive and the data query engine Impala.
Metadata of each data table in Hive is stored in a unified manner on HMS (Hive metadata storage service), and includes basic information of the table, column information, partition information, storage locations of the table and the partition, and the like. FIG. 1 shows the basic framework of an Impala cluster, including Impalad, statestored and Catalogd services, where Catalogd is deployed with Metastore, caching metadata on HMS; statestore is used to synchronize metadata cached on the catalyst onto the Impalad's Metastore; impalad is used to process a user's query request.
Impalad can be further divided into three services, planner, coordinator and exechamter. Generating a single machine execution plan by Planner according to a query request of a user, optimizing based on rules and table/column statistical information, and finally generating a distributed execution plan; the Coordinator distributes execution plans to different Impalad nodes. After receiving the execution plan, the Impalad node is responsible for specific statement fragments to execute operations, including reading data, executing Join operations of two tables, executing aggregation operations such as summation/averaging, and the like. An imala cluster typically has at least two Coordinator nodes and at least two Execator nodes.
Referring to fig. 2, it is assumed that a user creates a data table T1 in Hive and inserts a batch of data, and then queries T1 using Impala, and because there is no metadata of T1 in the Impala's catalyst and Impala's Metastore, the query will report errors, suggesting that T1 does not exist. In this case, the user is generally required to manually execute invalidate metadata T and refresh commands, invalidate the cached T1 metadata, and reload the T1 metadata from Hive to perform normal query. It can be seen that, since the cached data of Impala is inconsistent with the metadata of Hive, the data query efficiency is greatly affected, and manual intervention is required, so that the user experience is poor.
In the related art, an information synchronization function is added to the data query engine, but the information synchronization function synchronizes all data update information in the data warehouse to the data query engine, and a part of the information is redundant information irrelevant to the service of the data query engine, for example, a data table related to the Impala service may only account for 20% of all data tables in Hive, and the update information of the other 80% of data tables is redundant information for the data query engine, thereby resulting in resource waste and failure to effectively improve the data query efficiency.
The Impala version 3.2 adds metadata auto-sync functionality. When the user performs data operation on Hive, hive will write operation log to newly added log table notification_log in HMS. The imala regularly pulls the operation log from the notification_log of the HMS, and updates the metadata in the cache by analyzing the operation log, so that the consistency of the metadata of the imala and the Hive terminal is realized. The Impala 3.2 version also provides a function of filtering the operation log of some data tables, and the user may set the impala.disable hmssync parameter of the data table T1 to true or false, and if true, the Impala will filter the operation log of T1 when pulling the operation log from notification_log, and if false, will not filter the operation log. However, when the Impala. Disablehmmsync parameter is set in the data table of Hive, the Impala cannot perceive the change, and the user still needs to manually execute invalidate metadata T and refresh commands, so that the change made by the user is actually effective at the Impala end, and the implementation process is inconvenient. In addition, the notification_log records the complete information of each operation log, so that the storage space required by the notification_log is larger, and the maintenance cost is higher.
In view of the above, the basic idea of the present disclosure is that: on one hand, when the data in the data warehouse changes to cause the change of the related information of the data warehouse, the data warehouse information stored in the cache of the data query engine can be automatically updated, so that the data query engine is consistent with the information in the data warehouse, the accuracy of the result of the data query engine executing the data query task according to the data warehouse information in the cache is ensured, the updating process does not need manual intervention, and the user experience is good. On the other hand, the scheme realizes the simplification of the data operation information stored in the log library, and compared with the notification_log log table in the related technology, the quantity of the data operation information required to be stored is greatly reduced, the storage cost of the log library is reduced, the writing cost of the data operation information written into the log library by the data warehouse is reduced, the reading cost of the data operation information is obtained from the log library by the data query engine, and the calculation cost of the data operation information is processed, so that the performance and the processing efficiency of the whole system are improved.
Having described the basic principles of the present disclosure, various non-limiting embodiments of the present disclosure are specifically described below.
Application scene overview
It should be noted that the following application scenarios are only shown for facilitating understanding of the spirit and principles of the present disclosure, and embodiments of the present disclosure are not limited in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.
The present disclosure may be applied to all scenarios of data warehouse and data query engine architecture, such as: in applications such as large electronic commerce, news, games, and the like, a server typically deploys a data warehouse to store all relevant data, deploys a data query engine to perform a data query task, and through the data processing method of the exemplary embodiment of the present disclosure, information synchronization between the data query engine and the data warehouse can be achieved, so as to improve efficiency of the data query engine.
Exemplary System
The system architecture of the operating environment of an exemplary embodiment of the present disclosure is described below in conjunction with fig. 3. It should be noted that fig. 3 shows only the system components necessary to implement the data processing method of the exemplary embodiments of the present disclosure,
referring to FIG. 3, a system architecture 300 includes a data warehouse 310 and a data query engine 320. The data warehouse 310 includes a data warehouse server 311 and a log library 312, where the data warehouse server 311 is configured to execute a data processing method at the data warehouse 310 end according to an exemplary embodiment of the present disclosure, and may be a cluster formed by one computer or multiple computers installed with a program related to the data warehouse, or may be other electronic devices with processing functions; the log repository 312 is used to store data manipulation information and other relevant information in the data repository 310, which may be a specially partitioned storage space in the data repository 310, or a specific storage device or devices, which may employ a relational database, such as MySQL. The data query engine 320 includes a data query engine server 321 and a cache 322, where the data query engine server 321 is configured to execute a data processing method at the data query engine 320 end in the exemplary embodiment of the disclosure, and may be a cluster formed by one computer or multiple computers installed with a program related to the data query engine, or may be other electronic devices with processing functions; cache 322 is used to store information about data warehouse 310, such as metadata information for data tables in data warehouse 310, etc., and cache 322 may be a storage space dedicated to data query engine 320, or a specific storage device or devices.
It should be noted that, the life cycle of the data in the cache 322 is relatively short with respect to the data warehouse 310 for persistent storage of the data, and thus is referred to as a cache, but the life cycle of the data in the cache 322 is not particularly limited in the present disclosure. For example, the cache 322 may use FIFO (First Input First Output, first in first out), LRU (Least Recently Used ), etc. mechanisms to retire data therein, the lifecycle of the data being related to the size of the space of the cache 322, the frequency of data usage, or update.
A user may operate on data in the data repository 310, such as creating a new data table, deleting an original data table, etc.; the user may also perform a data query via the data query engine 320, such as a user entering query terms, the data query engine 320 queries the data repository 310 for relevant data and presents it.
Exemplary method
Fig. 4 shows a flow chart of a data processing method performed by a data warehouse, comprising the following steps S410 and S420:
step S410, data manipulation information is obtained for one or more data groups in the data warehouse.
The data set may be any level of data storage unit in the data warehouse, for example, may be a sub-database, a data table, a data partition, a data slice, etc., and a specific form of the data set may be related to a data organization manner of the data warehouse, which is not limited in this disclosure.
The data operation information is log information indicating operations performed on data in the data group, and includes DDL (Data Definition Language ) operation information, which may be information for performing operations of creating, modifying, and deleting objects such as tables, indexes, and users using commands such as create, alter, drop, and DML (Data Manipulation Language ) operation information, which may be information for performing operations of creating, modifying, and deleting records or rows in tables using commands such as insert, update and delete. It should be noted that the data operation information may be from a data operation manually performed by a user, for example, a command such as manual input create, alter, drop by the user, or may be from a data operation automatically performed by the system, for example, a data warehouse automatically obtains service data from a third party database and adds the service data to local service data.
In one embodiment, all data manipulation information in the data warehouse may be obtained, i.e., information for each data manipulation for each data group in the data warehouse.
In another embodiment, the data operation information in the data warehouse may be screened, and only the data operation information meeting a certain condition may be obtained to meet the actual requirement. For example, only data operation information based on a data definition language, that is, DDL operation information may be acquired; alternatively, when data synchronization is periodically performed, only the data operation information in the current period may be acquired.
Step S420, writing the data operation information of the data group meeting the data synchronization condition into the log library, so that the data query engine updates the cache data of the data query engine by acquiring the data operation information in the log library.
The data synchronization condition is used for judging whether the information of a certain data group needs to be synchronized to the data query engine. In general, a certain data set has a correlation with the service of the data query engine, and the data query engine may use the information of the data set, so that the information of the data set needs to be synchronized in the data query engine. The data synchronization conditions can be set manually by a user or can be set by the system according to the service range of the data query engine.
In the present exemplary embodiment, the log library is used to store the data operation information of those data groups that satisfy the data synchronization condition, instead of all the data operation information, that is, when the data operation information is written into the log library, the data operation information is filtered once according to the data synchronization condition, so that the number of data operation information stored in the log library is reduced, and resource overhead is reduced.
The present disclosure is not limited to a specific form of the data synchronization condition, and is specifically described below by way of two examples:
(1) The data synchronization condition is achieved by setting a synchronization attribute parameter in the data set. Referring to fig. 5, the writing of the data operation information of the data group satisfying the data synchronization condition into the log library may include:
step S510, obtaining the synchronous attribute parameters of the one or more data sets;
step S520, writing the data operation information of the data group with the synchronization attribute parameter being a preset value into a log library.
The synchronization attribute parameter is used to indicate whether the information of the data set needs to be synchronized to the data query engine. The values of the synchronization attribute parameters generally include two types: the preset value is not preset value. When the synchronization attribute parameter is a preset value, information of the data group to be synchronized is indicated to the data query engine; when the synchronization attribute parameter is not a preset value, the information of the data group does not need to be synchronized to the data query engine. For example, a synchronization attribute parameter sync_Metastere is set in each data table, a value of true (or 1) for sync_Metastere indicates that synchronization is required, and a value of false (or 0) indicates that synchronization is not required; that is, true (or 1) is a preset value and false (or 0) is a non-preset value.
After the data operation information is acquired, further acquiring synchronous attribute parameters of the data groups corresponding to the data operation information, and writing the data operation information of the data groups with the synchronous attribute parameters being preset values into a log library.
It should be noted that the synchronization attribute parameter is generally set in the data group, for example, recorded in header information of the data table. For a data warehouse, the synchronization attribute parameters in the data set may be read directly to perform the method shown in fig. 5. For the data query engine, as the data warehouse end has filtered the data operation information, the data query engine only needs to read the data operation information from the log library and synchronously buffer the data, i.e. the data query engine does not need to acquire the synchronous attribute parameters.
In one embodiment, the synchronization attribute parameters of each data group in the data warehouse may be recorded in the data query engine, so that when the filtering mechanism of the data operation information in the data warehouse is abnormal, that is, when the data operation information of the data group which does not meet the data synchronization condition is stored in the log database, the data query engine may filter the data operation information according to the synchronization attribute parameters of each data group, or the data warehouse and the data query engine may perform two times of filtering to ensure the filtering result. Based on this, synchronization attribute parameters recorded in the data query engine need to be synchronized to ensure consistency with the synchronization attribute parameters in the data warehouse. In one embodiment, referring to fig. 6, the data processing method may further include:
Step S610, acquiring attribute operation information aiming at the synchronous attribute parameters;
step S620, the attribute operation information is written into the log library, so that the data query engine obtains the attribute operation information in the log library.
The attribute operation information refers to operation information for changing the synchronization attribute parameter, for example, when the user manually changes the sync_meta value of the data table, the current operation of the user is stored as one piece of attribute operation information. The attribute operation information generally includes two kinds: changing the synchronization attribute parameter from a non-preset value to a preset value to indicate that the information of the data group which does not need to be synchronized is changed into the information of the data group which needs to be synchronized; changing the synchronization attribute parameter from a preset value to a non-preset value indicates that the information of the data set to be synchronized is changed to the information of the data set not to be synchronized. In this embodiment, the attribute operation information may be written into the log library, so that the data query engine may obtain the attribute operation information from the log library, and synchronously update the synchronous attribute parameters recorded in the data query engine.
It should be noted that the attribute operation information and the data operation information are information for operating on different objects in the data warehouse, the attribute operation information is for synchronizing attribute parameters of the data set, and the data operation information is for data in the data set.
In one embodiment, the writing of the attribute operation information into the log library may be achieved by:
when the attribute operation information is to change the synchronous attribute parameter from a non-preset value to a preset value, the attribute operation information is written into a log library.
For example, when a user changes sync_measure of a certain data table from false to true, the piece of attribute operation information is written into the log library; conversely, when the user changes the sync_measure of a certain data table from true to false, the attribute operation information may not be written into the log library. Therefore, besides the data operation information, the log library only needs to store the attribute operation information changed from a non-preset value to a preset value, and the attribute operation information can ensure that a data set which is determined by a data query engine and needs to be synchronized is not missed, and meanwhile, the storage content in the log library is further reduced.
(2) The data synchronization condition is achieved by setting a whitelist of the data sets. The data sets on the white list are all data sets which have correlation with the service of the data query engine, namely the data sets needing synchronous information. Based on this, the writing of the data operation information of the data group satisfying the data synchronization condition into the log library may include: and writing the data operation information of the data group on the white list into a log library. Data manipulation information for data sets outside the whitelist can thereby be filtered out.
In one embodiment, the whitelist may be configured in the data warehouse while configured in the data query engine. In order to ensure that the whitelists at the two ends are consistent, when the whitelist in the data warehouse is changed, corresponding whitelist operation information can be written into the log library, so that the data query engine can acquire the whitelist operation information in the log library, and then the whitelist in the data query engine can be synchronously updated.
The specific form of the data synchronization condition is described above by way of two examples. The data synchronization condition is used for filtering the data operation information, so that the writing cost of the data operation information written into the log library by the data warehouse, the storage cost of the log library and the reading cost of the data operation information acquired from the log library by the data query engine can be reduced.
To further reduce overhead, each piece of data manipulation information itself may also be reduced.
In one embodiment, each data set may include one or more data partitions, which are data storage units one level lower than the data set. Based on this, the writing of the data operation information of the data group satisfying the data synchronization condition into the log library may include the steps of:
Determining the changed data partition in the data group meeting the data synchronization condition;
and writing the data operation information of the changed data partition into a log library.
The changed data partition is the data partition which performs the data operation, and other data partitions in the data group do not perform the data operation, namely, do not change before and after the operation, so that only the data operation information of the changed data partition can be written into the log library, namely, the data operation information in the log library does not comprise the related information of the unchanged data partition, and the content of the data operation information is simplified.
In one embodiment, the data operation information of the changed data partition may include: a data group identifier (table_name), an operation type (event_type), a pre-operation data partition identifier (before section_name), and a post-operation data partition identifier (after section_name). After obtaining the data operation information, the data query engine can determine which data partition of which data group in the data warehouse is changed according to the contents in the fields, so that the information of the data partitions is synchronized.
It should be noted that, if the operation type is the newly added data partition, the data partition identifier before the operation is a null value, and the data partition identifier after the operation is the newly added data partition identifier; if the operation type is deleting the data partition, the data partition before the operation is identified as the deleted data partition, and the data partition after the operation is identified as a null value.
In addition, other related contents such as an operation event identification (event_id), an operation time (event_time), and the like may be included in the data operation information.
In one embodiment, a metadata storage component, such as HMS in Hive, is provided in the data warehouse for recording metadata for each data set in the data warehouse. Based on this, in step S410, data manipulation information for one or more data groups in the data warehouse may be obtained from the metadata storage component of the data warehouse. In general, the metadata storage component can provide a service of data operation information persistence, the metadata storage component is associated with a specific relational database, and all data operation information in the data warehouse is persisted into the database, so that the data operation information can be acquired from the database associated with the metadata storage component, and related data operation information is not required to be collected by each data group, so that the method is more convenient and efficient.
In one embodiment, the data operation information of the data group satisfying the data synchronization condition may be written to the log repository by a log writing thread in the data repository. The log writing thread may be a program component added on the basis of the data warehouse, and is used for executing the relevant logic of the step S420, filtering the data operation information, and writing the filtered data operation information into the log warehouse. Therefore, the related logic can be realized by adding the log writing thread and writing the related code in the data warehouse, which is equivalent to the expansion of the functions of the data warehouse, and the realization is easy without modifying the logic of the data warehouse.
FIG. 7 shows a flow architecture of a data processing method implemented based on a Hive data warehouse. When the data in Hive is operated, writing data operation information into the HMS by the Hive server, for example, the data operation information comprises 10 pieces of data operation information, and the data operation information is respectively data tables T1, T2, T3, T4, T1, T4, T6, T2, T7 and T4 according to the operation time sequence; then judging whether the data tables corresponding to the data operation information meet the data synchronization conditions by using a log writing thread, and if the synchronization attribute parameter sync_meta-values of T1 are false and the sync_meta-values of T2, T3, T4, T6 and T7 are true, then T2, T3, T4, T6 and T7 meet the data synchronization conditions; the log writing thread filters the data operation information of T1 from the 10 pieces of data operation information, and writes the rest data operation information into a log library. The information in the log library may be read by a data query engine.
Fig. 8 shows a flow chart of a data processing method performed by a data query engine, comprising the following steps S810 and S820:
step S810, obtaining data operation information from a log library; the log library is used for storing data operation information of data groups meeting data synchronization conditions in the data warehouse.
Because the data warehouse has already been filtered once when writing the data operation information into the log warehouse, the data operation information reserved is the data operation information required by the data query engine, so that the data query engine can acquire all the data operation information in the log warehouse, which is the data operation information required by the data query engine. For example, if new information is written in the snoop log repository, if so, the information is obtained.
In one embodiment, the data query engine may read the newly added data operation information in the log library when the synchronization trigger condition is satisfied. The synchronization trigger condition may include at least one of:
the newly added data operation information in the log library reaches the preset quantity. The preset number can be set according to the actual requirement, for example, 1000 pieces of data operation information are added in the log library, and the data query engine reads the 1000 pieces of data operation information, so that the data query engine divides the data operation information into different batches by taking the preset number as a unit, and obtains the data operation information in batches. The data query engine can monitor the quantity of the newly added data operation information in the log library, and the log library can inform the data query engine when detecting that the newly added data operation information reaches the preset quantity;
Reaching a preset synchronization time. The synchronization time can be set according to actual requirements, for example, the synchronization time is set by taking 5 minutes as a period, and every 5 minutes of nodes are taken as the synchronization time, namely, the data query engine reads data operation information from the log library every 5 minutes;
and receiving a synchronous triggering operation input by a user. The user can manually trigger the data query engine to read the data operation information from the log library and update the cached data.
In addition, other synchronous triggering conditions can be set according to actual demands, or any plurality of synchronous triggering conditions are combined for use, for example, when the preset synchronous time is reached, whether the newly added data operation information in the log library reaches the preset quantity is checked, and if the newly added data operation information in the log library reaches the preset quantity, the newly added data operation information in the log library is read. The present disclosure is not limited in this regard.
Step S820, updating the cache data of the data query engine according to the acquired data operation information.
Wherein the cache of the data query engine is used to store information about the data warehouse, such as metadata for one or more data groups in the data warehouse (typically data groups that are relevant to the business of the data query engine), full data for a portion of the hot or critical data groups in the data warehouse, etc. The data query engine can update the cache data according to the data operation information obtained from the log library, so that the cache data is consistent with the related information of the data warehouse, and the accuracy of the data query result is ensured.
In one embodiment, in step S820, the data operation information may be loaded by a synchronization thread in the data query engine, and the cache data of the data query engine is updated according to the data operation information. If the data query engine does not have the data synchronization function, the synchronization thread can be added, which is equivalent to expanding the function of the data query engine without modifying the logic of the data query engine. If the data query engine has a data synchronization function, for example, the Impala version 3.2, since the synchronization thread in this embodiment is different from the data synchronization function logic in the related art, the synchronization thread in this embodiment may be newly added (for example, the synchronization thread may be named metadata sync thread) to replace the original data synchronization function. The imala 3.2 version data synchronization function needs to read the whole operation log from the data warehouse and perform screening, but the synchronization thread of the embodiment only needs to read the data operation information in the log library, and the data operation information in the log library is filtered, so that the synchronization thread occupies fewer resources.
In one embodiment, the cache of the data query engine is used to store metadata for one or more data groups in the data warehouse, e.g., impala stores metadata for the data warehouse in a cache Metastore of catated. Based on this, step S820 may be implemented by:
And updating metadata of a data group to be updated in a cache of the data query engine according to the acquired data operation information, wherein the data group to be updated is a data group corresponding to the data operation information.
Generally, the data operation information includes a data group identifier, and after the data operation information is read, it can be determined which data group the data operation information corresponds to, so as to update metadata of the corresponding data group. For example, the data query engine reads 8 pieces of data operation information from the log library, corresponding to the data sets T2, T3, T4, T6, T2, T7, and T4, respectively, and the data query engine may determine that the data set to be updated is T2, T3, T4, T6, and T7, so as to update metadata of the 5 data sets in the cache without updating metadata of other data sets. Thereby improving the pertinence of updating metadata and being beneficial to improving the efficiency.
In one embodiment, to further increase efficiency, step S820 may also be implemented by:
combining data operation information corresponding to the same data group to be updated;
and updating the metadata of the data group to be updated in the cache of the data query engine according to the combined data operation information.
In general, when data operation information is obtained from a log library, a data query engine obtains data operation information newly added in the log library according to batches, for example, when the above synchronous trigger condition is met, and multiple pieces of information corresponding to the same data group may exist in the data operation information, for example, multiple data operations are performed on a certain data group in a short time, so as to generate multiple pieces of data operation information. The data query engine may combine the data operation information corresponding to the same data set to be updated, for example, the 8 pieces of data operation information correspond to the data sets T2, T3, T4, T6, T2, T7, and T4, and obtain 5 pieces of data operation information after combination, which correspond to the data sets T2, T3, T4, T6, and T7, respectively, and then update the metadata. Therefore, the number of data operation information is reduced, each piece of data operation information does not need to be processed independently, the data group is used as a unit for updating operation, the number of updating operation times is reduced, and the efficiency is further improved.
In one embodiment, a data set includes one or more data partitions; the data operation information may be data operation information of a data partition in which a change occurs, including: the method comprises the steps of data group identification, operation type, data partition identification before operation and data partition identification after operation. Based on this, referring to fig. 9, the updating the metadata of the data group to be updated in the cache of the data query engine according to the data operation information may include:
step S910, searching the data group corresponding to the data group identifier in the data warehouse, and acquiring metadata of the data partition corresponding to the operated data partition identifier from the searched data group;
step S920, searching the data group to be updated corresponding to the data group identifier in the cache of the data query engine, and updating the data partition corresponding to the data partition identifier before the operation in the data group to be updated according to the acquired metadata.
For example, at the data warehouse end, the operation of changing the identifier is performed on the data partition S1 in the data set T1, and S1 is changed to S1', where the piece of data operation information includes: data group identification=t1, operation type=after section_name, data partition identification before operation=s1, data partition identification after operation=s1'. After the data query engine acquires the data operation information, searching T1 in the data warehouse, searching S1' in the T1, acquiring the metadata of the S1', returning to the cache of the data query engine to search T1, searching S1 in the T1, and updating the metadata of the S1 into the metadata of the S1', so that the metadata of the data query engine is consistent with the metadata of the data warehouse. Therefore, in the updating process, the data query engine can complete updating only by acquiring the metadata of the corresponding data partition, and the method has stronger pertinence and higher efficiency.
It should be added that in one embodiment, when the operation type is a newly added data partition, the data partition before the operation is identified as a null value. For example, the data partition S1' is newly added to the data set T1, and this piece of data operation information includes: data group identification=t1, operation type=create section, data partition identification before operation=null, data partition identification after operation=s1'. Based on this, in step S920, the data query engine may not find the data partition corresponding to the pre-operation data partition identifier (null) in the data set to be updated, and may establish a new data partition in the data set to be updated, and update the new data partition according to the acquired metadata, where the metadata of the new data partition is consistent with the metadata of S1'.
By the method of fig. 9, on the basis of simplifying data operation information, it can be ensured that the data query engine effectively completes metadata update, thereby reducing resources required for the data update process.
In one embodiment, the data processing method may further include the steps of:
metadata in a cache of the data query engine is counted to update metadata statistics.
Among them, metadata statistics include, but are not limited to: statistical information of data partitions, rows and columns in the data group, record number, average size of records and the like. The data query engine can read the metadata in the cache, and perform corresponding statistics to obtain metadata statistical information. Metadata statistics may be stored in a cache of the data query engine or may be stored elsewhere in the data query engine for reference when the data query engine performs data query tasks.
In one embodiment, metadata in a cache of a data query engine may be counted by a statistics thread in the data query engine to update metadata statistics. The statistics thread may be a program component added to the data query engine, for example, the statistics thread of this embodiment may be newly added to Impala (for example, the statistics thread may be named compute stat thread) for performing relevant logic for metadata statistics. Therefore, the relevant logic can be realized by adding the statistical thread into the data query engine and writing the relevant code, which is equivalent to expanding the function of the data query engine, and the data query engine is easy to realize without modifying the logic of the data query engine.
In one embodiment, the data query engine may perform statistics on metadata in a cache of the data query engine when a statistics trigger condition is satisfied. The statistical triggering condition comprises at least one of the following:
the proportion of the data group to be counted to all metadata in the cache of the data query engine reaches a preset proportion. The metadata statistics information of the data group to be counted is lagged and inaccurate, so that if the proportion of the data group to be counted to all metadata in a cache of the data query engine is too high, the data query task is greatly influenced. The preset proportion can be set according to the actual requirement, for example, 30%, the data set to be counted is kept below 30%, and the influence generated by the data set can be considered acceptable;
Reaching a preset statistical time. The statistical time can be set according to the actual requirement, for example, the statistical time takes a period of 24 hours, and a fixed time of every 24 hours can be taken as the synchronous time, namely, the data query engine performs metadata statistics every 24 hours. For the scene of offline batch processing data, data is usually produced in the early morning every day, a certain time after all data are produced can be selected, and metadata statistics is executed; for a scene of real-time streaming, the time of a certain service low peak can be used as the statistical time;
and receiving a statistical triggering operation input by a user. The user may manually trigger the data query engine to perform metadata statistics once.
In addition, other statistics triggering conditions can be set according to actual demands, or any plurality of statistics triggering conditions are combined for use, for example, when a preset statistics time is reached, whether the proportion of the to-be-counted data group to all metadata in the cache of the data query engine reaches a preset proportion is checked, and if so, metadata statistics is executed. The present disclosure is not limited in this regard.
In one embodiment, the statistics on metadata in the cache of the data query engine may be implemented by:
Determining a data set to be counted;
and counting the metadata of the data group to be counted in the cache of the data query engine.
For example, after metadata is counted last time, only metadata of data sets T2, T3, T4, T6, T7 are updated, the 5 data sets are determined as data sets to be counted, only metadata of the data sets to be counted are counted, corresponding metadata counting information is updated, metadata of data sets which are not updated need not to be counted, and thus the counting efficiency is improved.
In one embodiment, after updating the metadata statistics, a data query task may be performed based on the metadata statistics. The metadata statistical information is beneficial to optimizing a CBO (Cost-Based Optimization) mode of the data query task by the data query engine, and the query efficiency is improved. In the following, when the data query task is a join (join) query task, referring to fig. 10, the performing the data query task based on metadata statistics may include:
step S1010, determining a plurality of data groups associated with the connection query task in a cache of a data query engine;
Step S1020, determining a driving data group in the plurality of data groups based on metadata statistical information of the plurality of data groups;
in step S1030, the data of the driving data set is broadcast to the plurality of execution nodes of the data query engine, so that the plurality of execution nodes jointly execute the connection query task according to the data of the driving data set.
Wherein the driving data set and the driven data set are a set of opposite concepts, and the driving table and the driven table can be referred to. The driving data set refers to a data set which is used as a circulating basis in the connection query task, and the circulated data set is the driven data set.
For example, the user performs a query operation of "A join B", A and B being two data tables in the data warehouse, A having 1000 data records and B having 100000 data records (assuming that each data record is of similar size). If the data query engine does not perform metadata statistics and cannot learn metadata statistics of A and B, taking B as a driving table (i.e. a driving data group) of join according to a default sequence; when the join operation is executed, the data query engine firstly acquires data of A and B from the data warehouse, broadcasts the data of B to a plurality of executing nodes, and then the nodes jointly execute the connection query task, and the data of B is broadcast to cause very high cost. In this embodiment, the data query engine has metadata statistics of a and B, and a may be determined as a driving table based on CBO; referring to fig. 11, data of a is broadcast to a plurality of executing nodes, each executing node uses a as a driving table to perform joint cycle matching on data of B, so that join operation is completed, and compared with broadcasting data of B, required overhead is greatly reduced.
In one embodiment, the log repository is further used to store attribute operation information for the synchronization attribute parameters of the data set. Based on this, referring to FIG. 12, the data query engine may also perform the following steps:
step S1210, obtaining attribute operation information from a log library;
step S1220, the data synchronization status of the data set corresponding to the attribute operation information in the data query engine is updated according to the attribute operation information.
The data query engine may record a data synchronization status of each data set, where the data synchronization status is used to indicate whether the data sets need to be information synchronized, for example, the data synchronization status may be recorded in the same manner as the data warehouse, such as a synchronization attribute parameter, a white list, etc., or may record the data synchronization status in a manner different from the data warehouse, for example, the data synchronization status is used as a field in metadata, and is recorded in cached metadata.
The attribute operation information may refer to the content of the above-mentioned fig. 6 part, and will not be described herein. After the data query engine obtains the attribute operation information, the content of the attribute operation information is analyzed, which comprises two cases:
when the attribute operation information is to change the synchronous attribute parameter from a non-preset value to a preset value, the data query engine can update the data synchronous state of the data group corresponding to the attribute operation information in the data query engine from asynchronous to synchronous;
When the attribute operation information is to change the synchronous attribute parameter from the preset value to the non-preset value, the data query engine can update the data synchronous state of the data group corresponding to the attribute operation information in the data query engine from synchronous to asynchronous.
On the basis that the data query engine records the data synchronization state of the data group, the data query engine can carry out secondary filtering on the data operation information in the log library so as to cope with the situation that the data operation information of the data group which does not meet the data synchronization condition is stored in the log library when the data operation information filtering mechanism in the data warehouse is abnormal, thereby realizing a more reliable fault-tolerant mechanism.
Fig. 13 shows a flow architecture of a data processing method implemented based on an Impala data query engine. The imperial service establishes a synchronization thread and a statistics thread:
in response to the synchronization trigger condition being met, the synchronization thread acquires newly added data operation information from the log library, for example, batch information in each period can be read according to a first period, 4 pieces of data operation information corresponding to the data tables T2, T3, T4 and T4 are acquired in the first batch, and 4 pieces of data operation information corresponding to the data tables T6, T2, T7 and T4 are acquired in the second batch as shown in fig. 13; after the synchronous thread acquires the data operation information of each batch, the data operation information of the same data table can be combined, for example, two pieces of data operation information corresponding to T4 in a first batch are combined into one piece, and the data operation information corresponding to the same data table in a second batch is not combined; furthermore, the synchronous thread can update the metadata of the data table in the cache according to the merged data operation information, so that the metadata cache Metastore of the catalyst is consistent with the metadata in the data warehouse.
The statistics thread responds to the condition of meeting the statistics trigger to perform statistics on the metadata in the cache, for example, according to a second period, the data tables are updated after the last statistics of the metadata are acquired from the synchronous thread, the data tables to be updated are determined, and then the metadata of the data tables to be updated are read from the cache and are subjected to statistics so as to update the metadata statistics information of the data tables to be updated.
The updated metadata in the catalyst can be synchronized to each Impalad node in the Impala cluster through Statestored to ensure global metadata synchronization. The statistical thread can synchronize metadata statistical information to each Impalad node in the Impala cluster so that the Impalad nodes execute the optimization of the data query task based on the CBO mode.
The data operation information described in the present exemplary embodiment may be DDL operation information or DML operation information.
In big data scenarios, DDL operations are the primary data manipulation approach. The DDL operation is generally implemented by file import, for example, when a user adds a batch of data to a data warehouse, the user can edit a corresponding data file first and then import the data file to the data warehouse. The method for importing the big data scene and the file into the DDL operation is combined with the data processing method of the present exemplary embodiment, which is favorable to reduce the acquired data operation information quantity and the data operation information quantity written into the log library, thereby reducing the resource expenditure required by executing the data processing method at both ends of the data warehouse and the data query engine. It can be seen that the scenario of big data is more suitable for applying the data processing method of the present exemplary embodiment.
Compared with DDL operation, the DML operation generally needs to directly operate data in a data warehouse, the generated data operation information is relatively scattered, the quantity of the generated data operation information is large, and the data processing method of the exemplary embodiment can be theoretically applied to synchronously update the data warehouse information. In one embodiment, to reduce the overhead (in particular, the I/O overhead) caused by a large amount of scattered DML-based data operation information, and a large amount of small files need to be stored in the system, a mechanism for cumulative synchronization may be set, that is, when the data operation information accumulated in the data warehouse reaches a certain condition, for example, when it is accumulated to 1000 pieces, batch-operating is performed on the 1000 pieces of data operation information, for example, the data processing method shown in fig. 4 is performed on the 1000 pieces of data operation information.
Fig. 14 shows an architecture diagram of a data processing flow based on Hive and Impala. When a user performs DDL operation on data in Hive, the Hive server persists data operation information to HMS, for example, the user performs DDL operation 10 times in one session, generating 10 pieces of data operation information corresponding to data tables T1, T2, T3, T4, T1, T4, T6, T2, T7, T4, respectively; and then, the log writing thread screens out data operation information T2, T3, T4, T6, T2, T7 and T4 of the data table meeting the data synchronization condition, and writes the data operation information into a log library. The synchronous thread of the Impala reads data operation information from the log library in batches, combines the data operation information corresponding to the same data table in each batch, and updates metadata in the cache according to the combined data operation information. The statistical thread of the Impala acquires the data table updated after the last statistical metadata from the synchronous thread, namely the data table to be updated, reads the metadata of the data table to be updated from the cache and performs statistics so as to update the metadata statistical information of the data table to be updated. The updated metadata and metadata statistics may be synchronized to each Impalad node in the Impala cluster. When a user initiates a query request to the Impala, the Impala can execute a data query task based on the cached metadata and metadata statistical information so as to quickly obtain a data query result.
Exemplary apparatus
A data processing apparatus according to an exemplary embodiment of the present disclosure is described below with reference to fig. 15 and 16.
Fig. 15 shows a data processing apparatus 1500 provided in a data warehouse, which may include:
an operation information acquisition module 1510 configured to acquire data operation information for one or more data groups in the data warehouse;
an operation information writing module 1520 configured to write data operation information of the data group satisfying the data synchronization condition into the log library, so that the data query engine updates the cache data of the data query engine by acquiring the data operation information in the log library.
In one embodiment, the operation information writing module 1520 is configured to:
acquiring the synchronous attribute parameters of the one or more data sets;
and writing the data operation information of the data group with the synchronous attribute parameter being a preset value into a log library.
In one embodiment, the operation information acquisition module 1510 is configured to acquire attribute operation information for the synchronization attribute parameter;
an operation information writing module 1520 configured to write the attribute operation information into the log library so that the data query engine obtains the attribute operation information in the log library.
In one embodiment, the operation information writing module 1520 is configured to write the attribute operation information to the log library when the attribute operation information is to change the synchronization attribute parameter from a non-preset value to a preset value.
In one embodiment, the operation information writing module 1520 is configured to write the data operation information of the data group located on the white list to the log library.
In one embodiment, the operation information acquisition module 1510 is configured to acquire data operation information for one or more data groups in the data warehouse from a metadata storage component of the data warehouse.
In one embodiment, the operation information writing module 1520 is configured to write the data operation information of the data group satisfying the data synchronization condition to the log library through the log writing thread in the data warehouse.
In one embodiment, a data set includes one or more data partitions;
an operation information writing module 1520 configured to:
determining the changed data partition in the data group meeting the data synchronization condition;
and writing the data operation information of the changed data partition into a log library.
In one embodiment, the data operation information of the changed data partition includes: the method comprises the steps of data group identification, operation type, data partition identification before operation and data partition identification after operation.
In one embodiment, the data manipulation information includes DDL-based data manipulation information.
FIG. 16 illustrates a data processing apparatus 1600 provided with a data query engine, which may include:
an operation information acquisition module 1610 configured to acquire data operation information from a log library; the log library is used for storing data operation information aiming at a data group meeting data synchronization conditions in the data warehouse;
the cache data updating module 1620 is configured to update the cache data of the data query engine according to the acquired data operation information.
In one embodiment, a cache of a data query engine is used to store metadata for one or more data groups in a data warehouse;
the cache data updating module 1620 is configured to update metadata of a data group to be updated in a cache of the data query engine according to the data operation information, where the data group to be updated is a data group corresponding to the data operation information.
In one embodiment, the cache data update module 1620 is configured to:
combining data operation information corresponding to the same data group to be updated;
and updating the metadata of the data group to be updated in the cache of the data query engine according to the combined data operation information.
In one embodiment, a data set includes one or more data partitions;
the data operation information includes: the method comprises the steps of data group identification, operation type, data partition identification before operation and data partition identification after operation;
a cache data update module 1620 configured to:
searching a data group corresponding to the data group identifier in the data warehouse, and acquiring metadata of a data partition corresponding to the operated data partition identifier from the searched data group;
searching a data group to be updated corresponding to the data group identifier in a cache of the data query engine, and updating a data partition corresponding to the data partition identifier before operation in the data group to be updated according to the acquired metadata.
In one embodiment, when the operation type is a newly added data partition, the data partition before the operation is identified as a null value;
the cache data updating module 1620 is configured to establish a new data partition in the data group to be updated, and update the new data partition according to the acquired metadata.
In one embodiment, data processing apparatus 1600 further comprises:
and the metadata statistics module is configured to count metadata in a cache of the data query engine so as to update metadata statistics information.
In one embodiment, the metadata statistics module is configured to:
determining a data set to be counted, wherein the data set to be counted is a data set updated after the last time of counting metadata;
and counting the metadata of the data group to be counted in the cache of the data query engine.
In one embodiment, the metadata statistics module is configured to perform statistics on metadata in a cache of the data query engine when a statistics trigger condition is satisfied.
In one embodiment, the statistical trigger condition includes at least one of: and when the proportion of the statistical data group to all metadata in the cache of the data query engine reaches a preset proportion and reaches a preset statistical time, receiving statistical triggering operation input by a user.
In one embodiment, the metadata statistics module is configured to perform statistics on metadata in a cache of the data query engine through a statistics thread in the data query engine to update metadata statistics.
In one embodiment, data processing apparatus 1600 further comprises:
and the query task processing module is configured to execute a data query task based on the metadata statistical information.
In one embodiment, the data query task includes a connection query task;
A query task processing module configured to:
determining a plurality of data groups associated with the connection query task in a cache of the data query engine;
determining a driving data group from the plurality of data groups based on metadata statistics of the plurality of data groups;
broadcasting the data of the driving data group to a plurality of execution nodes of the data query engine, so that the plurality of execution nodes jointly execute the connection query task according to the data of the driving data group.
In one embodiment, the operation information acquisition module 1610 is configured to read the newly added data operation information in the log library when the synchronization trigger condition is satisfied.
In one embodiment, the synchronization trigger condition includes at least one of: the newly added data operation information in the log library reaches the preset quantity and the preset synchronous time, and the synchronous trigger operation input by the user is received.
In one embodiment, the log library is further used to store attribute operation information for the synchronization attribute parameters of the data set;
an operation information acquisition module 1610 configured to acquire attribute operation information from a log library;
the cache data updating module 1620 is configured to update the data synchronization state of the data group corresponding to the attribute operation information in the data query engine according to the attribute operation information.
In one embodiment, the cache data update module 1620 is configured to update the data synchronization status of the data group corresponding to the attribute operation information from asynchronous to synchronous in the data query engine when the attribute operation information is to change the synchronization attribute parameter from a non-preset value to a preset value.
In one embodiment, the cache data update module 1620 is configured to load data operation information via a synchronization thread in the data query engine and update cache data of the data query engine according to the data operation information.
In addition, other specific details of the embodiments of the present disclosure are described in the foregoing embodiments of the method, and are not described herein.
Exemplary storage Medium
A storage medium according to an exemplary embodiment of the present disclosure is described below.
In the present exemplary embodiment, the above-described data processing method may be implemented by a program product, such as a portable compact disc read-only memory (CD-ROM) and includes a program code, and may be executed on a device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RE, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
Exemplary electronic device
An electronic device of an exemplary embodiment of the present disclosure is described with reference to fig. 17.
The electronic device 1700 shown in fig. 17 is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present disclosure.
As shown in fig. 17, the electronic device 1700 is in the form of a general purpose computing device. The components of electronic device 1700 may include, but are not limited to: at least one processing unit 1710, at least one storage unit 1720, a bus 1730 connecting the different system components (including the storage unit 1720 and the processing unit 1710), an input/output (I/O) interface 1740, and a network adapter 1750.
Wherein the storage unit stores program code that can be executed by the processing unit 1710, such that the processing unit 1710 performs the steps according to various exemplary embodiments of the present disclosure described in the above "exemplary method" section of the present specification. For example, the processing unit 1710 may perform the method steps shown in fig. 4, etc.
Storage unit 1720 may include volatile storage units such as random access storage unit (RAM) 1721 and/or cache storage unit 1722, and may further include read only storage unit (ROM) 1723.
Storage unit 1720 may also include a program/utility 1724 having a set (at least one) of program modules 1725, such program modules 1725 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 1730 may include a data bus, an address bus, and a control bus.
The electronic device 1700 may also communicate with one or more external devices 1800, e.g., keyboard, pointing device, bluetooth device, etc., via an input/output interface 1740. Electronic device 1700 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, for example, the Internet, through network adapter 1750. As shown, network adapter 1750 communicates with other modules of electronic device 1700 via bus 1730. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1700, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
It should be noted that while several modules or sub-modules of the apparatus are mentioned in the detailed description above, such partitioning is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.
Furthermore, although the operations of the methods of the present disclosure are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that this disclosure is not limited to the particular embodiments disclosed nor does it imply that features in these aspects are not to be combined to benefit from this division, which is done for convenience of description only. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (56)

1. A data processing method performed by a data warehouse, the method comprising:
acquiring data operation information for one or more data groups in the data warehouse;
writing data operation information of the data group meeting the data synchronization condition into a log library so that a data query engine updates cache data of the data query engine by acquiring the data operation information in the log library;
The writing the data operation information of the data group meeting the data synchronization condition into a log library comprises the following steps:
acquiring synchronization attribute parameters of the one or more data sets;
writing the data operation information of the data group with the synchronization attribute parameter being a preset value into the log library; the synchronization attribute parameter is a preset value to indicate that the information of the data set needs to be synchronized to the data query engine; the synchronization attribute parameter being a non-preset value indicates that the information of the data set does not need to be synchronized to the data query engine;
the method further comprises the steps of:
acquiring attribute operation information aiming at the synchronous attribute parameters;
writing the attribute operation information into the log library so that the data query engine updates the data synchronization state in the data query engine by acquiring the attribute operation information in the log library; when the data operation information filtering mechanism in the data warehouse is abnormal, the data operation information of the data groups which do not meet the data synchronization condition is stored in the log library, and the data operation information is filtered by the data query engine according to the synchronization attribute parameters of the data groups, or is filtered twice by the data warehouse and the data query engine.
2. The method of claim 1, wherein the synchronization attribute parameter is provided in the data set.
3. The method of claim 1, wherein the attribute operation information comprises: and changing the synchronous attribute parameter from a non-preset value to a preset value or changing the synchronous attribute parameter from the preset value to the non-preset value.
4. The method of claim 1, wherein the synchronization attribute parameter is provided in header information of the data set.
5. The method of claim 1, wherein writing data operation information of the data group satisfying a data synchronization condition to a log library comprises:
and writing the data operation information of the data group on the white list into the log library.
6. The method of claim 1, wherein the obtaining data manipulation information for one or more data groups in the data warehouse comprises:
the data manipulation information for one or more data groups in the data warehouse is obtained from a metadata storage component of the data warehouse.
7. The method of claim 1, wherein writing data operation information of the data group satisfying a data synchronization condition to a log library comprises:
And writing the data operation information of the data group meeting the data synchronization condition into a log library through a log writing thread in the data warehouse.
8. The method of claim 1, wherein the data set comprises one or more data partitions;
the writing the data operation information of the data group meeting the data synchronization condition into a log library comprises the following steps:
determining the changed data partition in the data group meeting the data synchronization condition;
and writing the data operation information of the changed data partition into the log library.
9. The method of claim 8, wherein the data manipulation information of the changed data partition comprises: the method comprises the steps of data group identification, operation type, data partition identification before operation and data partition identification after operation.
10. The method of any one of claims 1 to 9, wherein the data manipulation information comprises data manipulation information based on a data definition language.
11. A data processing method performed by a data query engine, the method comprising:
acquiring data operation information from a log library; the data warehouse writes the data operation information of the data group meeting the data synchronization condition into the log library; the data group meeting the data synchronization condition comprises: the synchronous attribute parameter is a data set with a preset value; the synchronization attribute parameter is a preset value to indicate that the information of the data set needs to be synchronized to the data query engine; the synchronization attribute parameter being a non-preset value indicates that the information of the data set does not need to be synchronized to the data query engine;
Updating cache data of the data query engine according to the data operation information;
the log library is also used for storing attribute operation information of synchronous attribute parameters aiming at the data group; the method further comprises the steps of:
acquiring the attribute operation information from the log library;
updating the data synchronization state of the data group corresponding to the attribute operation information in the data query engine according to the attribute operation information; when the data operation information filtering mechanism in the data warehouse is abnormal, the data operation information of the data groups which do not meet the data synchronization condition is stored in the log library, and the data operation information is filtered by the data query engine according to the synchronization attribute parameters of the data groups, or is filtered twice by the data warehouse and the data query engine.
12. The method of claim 11, wherein the cache of the data query engine is used to store metadata for one or more data groups in the data warehouse;
the updating the cache data of the data query engine according to the data operation information comprises the following steps:
and updating metadata of a data group to be updated in a cache of the data query engine according to the data operation information, wherein the data group to be updated is the data group corresponding to the data operation information.
13. The method of claim 12, wherein updating metadata of a data group to be updated in a cache of the data query engine according to the data operation information comprises:
combining the data operation information corresponding to the same data group to be updated;
and updating the metadata of the data group to be updated in the cache of the data query engine according to the combined data operation information.
14. The method of claim 12, wherein the data set comprises one or more data partitions; the data operation information includes: the method comprises the steps of data group identification, operation type, data partition identification before operation and data partition identification after operation;
the updating the metadata of the data group to be updated in the cache of the data query engine according to the data operation information comprises the following steps:
searching a data group corresponding to the data group identifier in the data warehouse, and acquiring metadata of a data partition corresponding to the operated data partition identifier from the searched data group;
searching a data group to be updated corresponding to the data group identifier in a cache of the data query engine, and updating a data partition corresponding to the data partition identifier before operation in the data group to be updated according to the acquired metadata.
15. The method of claim 14, wherein when the operation type is a newly added data partition, the pre-operation data partition is identified as a null value;
the updating the data partition corresponding to the data partition identifier before the operation in the data group to be updated according to the acquired metadata comprises the following steps:
and establishing a new data partition in the data group to be updated, and updating the new data partition according to the acquired metadata.
16. The method of claim 11, wherein the method further comprises:
and counting the metadata in the cache of the data query engine to update metadata statistics.
17. The method of claim 16, wherein the counting metadata in the cache of the data query engine comprises:
determining a data set to be counted, wherein the data set to be counted is a data set updated after the last time of counting metadata;
and counting the metadata of the data group to be counted in the cache of the data query engine.
18. The method of claim 16, wherein the counting metadata in the cache of the data query engine comprises:
And when the statistics triggering condition is met, counting the metadata in the cache of the data query engine.
19. The method of claim 18, wherein the statistical trigger condition comprises at least one of: the proportion of the data group to be counted accounting for all metadata in the cache of the data query engine reaches a preset proportion, the preset counting time is reached, and the counting triggering operation input by a user is received; the data group to be counted is the data group updated after the last counting of metadata.
20. The method of claim 16, wherein the counting metadata in the cache of the data query engine to update metadata statistics comprises:
and counting the metadata in the cache of the data query engine through a counting thread in the data query engine so as to update metadata statistics information.
21. The method of claim 16, wherein after updating the metadata statistics, the method further comprises:
and executing a data query task based on the metadata statistical information.
22. The method of claim 21, wherein the data query task comprises a connection query task; the executing the data query task based on the metadata statistical information comprises the following steps:
Determining a plurality of data groups associated with the connection query task in a cache of the data query engine;
determining a driving dataset among the plurality of datasets based on metadata statistics for the plurality of datasets;
broadcasting the data of the driving data set to a plurality of execution nodes of the data query engine, so that the plurality of execution nodes jointly execute the connection query task according to the data of the driving data set.
23. The method of claim 11, wherein the obtaining data manipulation information from the log repository comprises:
and when the synchronous triggering condition is met, reading the newly added data operation information in the log library.
24. The method of claim 23, wherein the synchronization trigger condition comprises at least one of: and the newly added data operation information in the log library reaches the preset quantity and the preset synchronous time, and synchronous triggering operation input by a user is received.
25. The method of claim 11, wherein the data synchronization status is recorded in cached metadata of the data query engine.
26. The method according to claim 11, wherein updating the data synchronization state of the data group corresponding to the attribute operation information in the data query engine according to the attribute operation information comprises:
When the attribute operation information is that the synchronous attribute parameter is changed from a non-preset value to a preset value, the data synchronous state of the data group corresponding to the attribute operation information in the data query engine is updated from asynchronous to synchronous.
27. The method of claim 11, wherein updating the cached data of the data query engine based on the data manipulation information comprises:
and loading the data operation information through a synchronous thread in the data query engine, and updating cache data of the data query engine according to the data operation information.
28. A data processing apparatus, disposed in a data warehouse, the apparatus comprising:
an operation information acquisition module configured to acquire data operation information for one or more data groups in the data warehouse;
an operation information writing module configured to write data operation information of the data group meeting a data synchronization condition into a log library, so that a data query engine updates cache data of the data query engine by acquiring the data operation information in the log library;
wherein the operation information writing module is configured to:
Acquiring synchronization attribute parameters of the one or more data sets;
writing the data operation information of the data group with the synchronization attribute parameter being a preset value into the log library; the synchronization attribute parameter is a preset value to indicate that the information of the data set needs to be synchronized to the data query engine; the synchronization attribute parameter being a non-preset value indicates that the information of the data set does not need to be synchronized to the data query engine;
the operation information acquisition module is configured to acquire attribute operation information aiming at the synchronous attribute parameters;
the operation information writing module is configured to write the attribute operation information into the log library so that the data query engine updates the data synchronization state in the data query engine by acquiring the attribute operation information in the log library; when the data operation information filtering mechanism in the data warehouse is abnormal, the data operation information of the data groups which do not meet the data synchronization condition is stored in the log library, and the data operation information is filtered by the data query engine according to the synchronization attribute parameters of the data groups, or is filtered twice by the data warehouse and the data query engine.
29. The apparatus of claim 28, wherein the synchronization attribute parameter is provided in the data set.
30. The apparatus of claim 28, wherein the attribute operation information comprises: and changing the synchronous attribute parameter from a non-preset value to a preset value or changing the synchronous attribute parameter from the preset value to the non-preset value.
31. The apparatus of claim 28, wherein the synchronization attribute parameter is provided in header information of the data set.
32. The apparatus of claim 28, wherein the operation information writing module is configured to write data operation information of the data group on a white list to the log library.
33. The apparatus of claim 28, wherein the operation information acquisition module is configured to acquire the data operation information for one or more data groups in the data warehouse from a metadata storage component of the data warehouse.
34. The apparatus of claim 28, wherein the operation information writing module is configured to write data operation information of the data group satisfying a data synchronization condition to a log library by a log writing thread in the data warehouse.
35. The apparatus of claim 28, wherein the data set comprises one or more data partitions;
the operation information writing module is configured to:
determining the changed data partition in the data group meeting the data synchronization condition;
and writing the data operation information of the changed data partition into the log library.
36. The apparatus of claim 35, wherein the data manipulation information of the changed data partition comprises: the method comprises the steps of data group identification, operation type, data partition identification before operation and data partition identification after operation.
37. The apparatus of any one of claims 28 to 36, wherein the data manipulation information comprises data manipulation information based on a data definition language.
38. A data processing apparatus, disposed in a data query engine, the apparatus comprising:
the operation information acquisition module is configured to acquire data operation information from the log library; the data warehouse writes the data operation information of the data group meeting the data synchronization condition into the log library; the data group meeting the data synchronization condition comprises: the synchronous attribute parameter is a data set with a preset value; the synchronization attribute parameter is a preset value, and the information of the data group needs to be synchronized to a data query engine; the synchronization attribute parameter being a non-preset value indicates that the information of the data set does not need to be synchronized to the data query engine;
A cache data updating module configured to update cache data of the data query engine according to the data operation information;
the log library is also used for storing attribute operation information of synchronous attribute parameters aiming at the data group;
the operation information acquisition module is configured to acquire the attribute operation information from the log library;
the cache data updating module is configured to update the data synchronization state of the data group corresponding to the attribute operation information in the data query engine according to the attribute operation information; when the data operation information filtering mechanism in the data warehouse is abnormal, the data operation information of the data groups which do not meet the data synchronization condition is stored in the log library, and the data operation information is filtered by the data query engine according to the synchronization attribute parameters of the data groups, or is filtered twice by the data warehouse and the data query engine.
39. The apparatus of claim 38, wherein the cache of the data query engine is to store metadata for one or more data groups in the data warehouse;
and the cache data updating module is configured to update metadata of a data group to be updated in a cache of the data query engine according to the data operation information, wherein the data group to be updated is the data group corresponding to the data operation information.
40. The apparatus of claim 39, wherein the cache data update module is configured to:
combining the data operation information corresponding to the same data group to be updated;
and updating the metadata of the data group to be updated in the cache of the data query engine according to the combined data operation information.
41. The apparatus of claim 39, wherein the data set includes one or more data partitions; the data operation information includes: the method comprises the steps of data group identification, operation type, data partition identification before operation and data partition identification after operation;
the cache data updating module is configured to:
searching a data group corresponding to the data group identifier in the data warehouse, and acquiring metadata of a data partition corresponding to the operated data partition identifier from the searched data group;
searching a data group to be updated corresponding to the data group identifier in a cache of the data query engine, and updating a data partition corresponding to the data partition identifier before operation in the data group to be updated according to the acquired metadata.
42. The apparatus of claim 41, wherein when the operation type is a newly added data partition, the pre-operation data partition is identified as a null value;
The cache data updating module is configured to establish a new data partition in the data group to be updated, and update the new data partition according to the acquired metadata.
43. The apparatus of claim 38, wherein the apparatus further comprises:
and the metadata statistics module is configured to count metadata in a cache of the data query engine so as to update metadata statistics information.
44. The apparatus of claim 43, wherein the metadata statistics module is configured to:
determining a data set to be counted, wherein the data set to be counted is a data set updated after the last time of counting metadata;
and counting the metadata of the data group to be counted in the cache of the data query engine.
45. The apparatus of claim 43, wherein the metadata statistics module is configured to perform statistics on metadata in a cache of the data query engine when a statistics triggering condition is satisfied.
46. The apparatus of claim 45, wherein the statistical trigger condition comprises at least one of: the proportion of the data group to be counted accounting for all metadata in the cache of the data query engine reaches a preset proportion, the preset counting time is reached, and the counting triggering operation input by a user is received; the data group to be counted is the data group updated after the last counting of metadata.
47. The apparatus of claim 43, wherein the metadata statistics module is configured to perform statistics on metadata in a cache of the data query engine by a statistics thread in the data query engine to update metadata statistics.
48. The apparatus of claim 43, further comprising:
and the query task processing module is configured to execute a data query task based on the metadata statistical information.
49. The apparatus of claim 48, wherein the data query task comprises a connection query task; the query task processing module is configured to:
determining a plurality of data groups associated with the connection query task in a cache of the data query engine;
determining a driving dataset among the plurality of datasets based on metadata statistics for the plurality of datasets;
broadcasting the data of the driving data set to a plurality of execution nodes of the data query engine, so that the plurality of execution nodes jointly execute the connection query task according to the data of the driving data set.
50. The apparatus of claim 38, wherein the operation information acquisition module is configured to read the newly added data operation information in the log library when a synchronization trigger condition is satisfied.
51. The apparatus of claim 50, wherein the synchronization trigger condition comprises at least one of: and the newly added data operation information in the log library reaches the preset quantity and the preset synchronous time, and synchronous triggering operation input by a user is received.
52. The apparatus of claim 38, wherein the data synchronization status is recorded in cached metadata of the data query engine.
53. The apparatus of claim 38, wherein the cache data updating module is configured to update a data synchronization status of a data group corresponding to the attribute operation information from asynchronous to synchronous in the data query engine when the attribute operation information is to change the synchronization attribute parameter from a non-preset value to a preset value.
54. The apparatus of claim 38, wherein the cache data update module is configured to load the data manipulation information via a synchronization thread in the data query engine and update cache data of the data query engine based on the data manipulation information.
55. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the method of any of claims 1 to 27.
56. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any one of claims 1-27 via execution of the executable instructions.
CN202110088454.9A 2021-01-22 2021-01-22 Data processing method, data processing device, storage medium and electronic equipment Active CN112817989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110088454.9A CN112817989B (en) 2021-01-22 2021-01-22 Data processing method, data processing device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110088454.9A CN112817989B (en) 2021-01-22 2021-01-22 Data processing method, data processing device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN112817989A CN112817989A (en) 2021-05-18
CN112817989B true CN112817989B (en) 2023-07-25

Family

ID=75858919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110088454.9A Active CN112817989B (en) 2021-01-22 2021-01-22 Data processing method, data processing device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112817989B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871388A (en) * 2019-02-19 2019-06-11 北京字节跳动网络技术有限公司 Data cache method, device, whole electronic equipment and storage medium
CN109992628A (en) * 2019-04-15 2019-07-09 深圳市腾讯计算机系统有限公司 Data synchronous method, apparatus, server and computer readable storage medium
CN111752907A (en) * 2020-05-28 2020-10-09 苏州浪潮智能科技有限公司 Centralized management and control method, system, terminal and storage medium for cluster logs

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7062515B1 (en) * 2001-12-28 2006-06-13 Vignette Corporation System and method for the synchronization of a file in a cache
CN109815261B (en) * 2018-12-11 2021-11-02 荣联科技集团股份有限公司 Global search function implementation and data real-time synchronization method and device and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871388A (en) * 2019-02-19 2019-06-11 北京字节跳动网络技术有限公司 Data cache method, device, whole electronic equipment and storage medium
CN109992628A (en) * 2019-04-15 2019-07-09 深圳市腾讯计算机系统有限公司 Data synchronous method, apparatus, server and computer readable storage medium
CN111752907A (en) * 2020-05-28 2020-10-09 苏州浪潮智能科技有限公司 Centralized management and control method, system, terminal and storage medium for cluster logs

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
尚海鹰 ; .基于交易报文的数据实时同步方法研究.计算机应用与软件.2017,(第11期),全文. *

Also Published As

Publication number Publication date
CN112817989A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
JP7271670B2 (en) Data replication method, device, computer equipment and computer program
US20230144450A1 (en) Multi-partitioning data for combination operations
CN109800222B (en) HBase secondary index self-adaptive optimization method and system
US9639590B2 (en) Database system and method for searching database
CN109918349B (en) Log processing method, log processing device, storage medium and electronic device
CN106874281B (en) Method and device for realizing database read-write separation
US10528590B2 (en) Optimizing a query with extrema function using in-memory data summaries on the storage server
US20230418811A1 (en) Transaction processing method and apparatus, computing device, and storage medium
CN109885642B (en) Hierarchical storage method and device for full-text retrieval
CN111917834A (en) Data synchronization method and device, storage medium and computer equipment
WO2024077802A1 (en) Cross-region data synchronization method and system, and computer readable medium
CN113760847A (en) Log data processing method, device, equipment and storage medium
CN113051221B (en) Data storage method, device, medium, equipment and distributed file system
CN111414356A (en) Data storage method and device, non-relational database system and storage medium
CN111767282A (en) MongoDB-based storage system, data insertion method and storage medium
CN112817989B (en) Data processing method, data processing device, storage medium and electronic equipment
CN115391457B (en) Cross-database data synchronization method, device and storage medium
US11940972B2 (en) Execution of operations on partitioned tables
US8484171B2 (en) Duplicate filtering in a data processing environment
CN114153857A (en) Data synchronization method, data synchronization apparatus, and computer-readable storage medium
CN112000666B (en) Database management system of facing array
CN114297196A (en) Metadata storage method and device, electronic equipment and storage medium
CN117390040B (en) Service request processing method, device and storage medium based on real-time wide table
US11727063B2 (en) Parallel partition-wise insert sub-select
JP5810982B2 (en) SEARCH DEVICE, SEARCH METHOD, AND SEARCH PROGRAM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant