CN111209321A

CN111209321A - Grouping data mart method for complex query

Info

Publication number: CN111209321A
Application number: CN201911354878.4A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Yonghong Tech Co ltd
Current assignee: Beijing Yonghong Tech Co ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-05-29

Abstract

The invention discloses a grouping data mart entering method aiming at complex query, which is used for extracting data with complex requirements in the process of extracting the data from a data source to a data mart and aiming at the condition that complex data source query is possibly generated, and provides a grouping accelerated data mart entering method. Firstly, judging whether grouping can be performed or not through characteristic analysis of a source data grouping column, then splitting according to different values of the grouping column, and finally importing the split data into a data mart. In order to speed up the process of entering the market and reduce the memory pressure, data is processed in units of 'blocks', and the data splitting process and the data importing process are carried out synchronously.

Description

Grouping data mart method for complex query

Technical Field

The invention relates to the technical field of data mart storage, in particular to a data mart grouping method aiming at complex query.

Background

Modern information technology has now entered the big data era. How to quickly construct data storage meeting the requirements of specific users and departments becomes a problem to be solved urgently by a data center. The data mart is a data cube which is extracted from a data warehouse, a database or various data sources in an enterprise range and is stored according to column data and faces decision analysis requirements. Including dimensions, hierarchy of dimensions, metrics needed to be computed, etc. In the process of importing various massive source data into a data mart, the query performance of the complex extraction requirement may be reduced by orders of magnitude. The performance problem that the complex query derived from the complex data extraction is imported into the data mart is urgently needed.

The import of source data into a data mart may generally be divided into three steps. Firstly, connecting a data source to perform source data query, extraction and processing, and loading a query result into a memory; secondly, performing column-type compression on the data in the memory, and reorganizing the data into column-type storage; and finally, generating a specific data block file, and distributing the specific data block file to the data mart nodes for storage.

Due to the diversity and complexity of the requirements, the extraction process may translate into a complex query of the source data for the query of the source data. However, the traditional processing method for grouping complex queries into data marts and grouping common queries into data marts is not different: and querying different grouping values according to the grouping columns, and then respectively adding the different grouping values serving as filtering conditions into the query of the source data. This results in more complex queries, and the speed of importing data into the data mart can be severely affected, or even the stability of the whole system can be affected.

Another key point of importing source data into a data mart is the system memory usage problem. The processing of mass data is imported into a data mart, and the data is subjected to discriminant storage and compression processing by a system in the middle. How to reduce the memory peak value and reduce the life cycle time of the object is also one of the problems to be solved urgently.

Therefore, how to speed up the data entering into the data mart and reduce the complex query effect, and how to reduce the memory peak value in the data mart storage process is a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of this, the present invention provides a method for grouping into data marts for complex queries, which provides a grouping acceleration method for grouping into data marts for the situation that complex queries may be generated for data extraction with complex requirements in the process of extracting data from a data source to a data mart. Firstly, judging whether grouping can be performed or not through characteristic analysis of a source data grouping column, then splitting according to different values of the grouping column, and finally importing the split data into a data mart. In order to accelerate the process of entering the data mart and reduce the memory pressure, the data is processed in a unit of 'block', grouping information is added into the meta-information of the finally generated data block, the data splitting process and the data importing process are carried out synchronously, and the speed of entering the data mart is improved. The method also comprises the steps of evaluating the feasibility of entering the data mart, and segmenting the data blocks in the memory according to different values of the grouping columns so as to reduce the number of complex queries to be 1.

In order to achieve the purpose, the invention adopts the following technical scheme:

a grouping data mart method aiming at complex query comprises the following specific steps:

step 1: loading source data, performing complex query on the source data, adding sequencing information of grouped columns in the complex query, performing feature analysis on the source data, and storing the source data meeting the segmentation features into a memory; otherwise, the source data generates data blocks according to the sequence of the sequencing information and leads the data blocks into the data mart nodes;

step 2: dividing the source data in the memory according to different values of the grouped columns to obtain the data blocks;

and step 3: adding metadata information in the data block to obtain an enhanced data block;

and 4, step 4: and compressing the enhanced data block to obtain a compressed data block, and distributing the compressed data block to the data mart node.

Preferably, a complex query process of source data loading, a source data segmentation process and a process of adding information compression data blocks into the data mart are respectively set as three thread models, and data streaming processing is adopted for parallel processing.

Preferably, when the complex query is performed in step 1, the data query layer ranks the source data according to the ranking information of the grouped columns.

Preferably, the grouped data blocks have the same data characteristics, and the data characteristics are recorded as the metadata information.

Preferably, the loading of the source data is streaming loading.

Preferably, if the query API interface of the complex query supports setting of a ranking sequence, the ranking information is added to the API interface, and the ranking operation pressure is pushed down to the data mart; otherwise, finishing the sorting operation of the source data by using a TimShort sorting algorithm.

Preferably, the performing the feature analysis on the source data in step 1 is to perform additional query on the source data by using the grouping column, and determine whether the source data meets a grouping condition, and the specific process is as follows:

step 11: acquiring the grouping column type, the number of the different values and the average data volume of each group of the grouping classes;

step 12: if the grouping type is not data and the number of the different values is smaller than a set maximum value, the source data meet the grouping condition when the average data volume is larger than the loaded source data.

According to the technical scheme, compared with the prior art, the invention discloses a grouping-in-data marketing method aiming at complex query, which is used for performing complex query on source data, sequencing the source data according to sequencing information, and performing block processing on the source data according to grouping conditions through grouping-type query operation, so that the complex query frequency is reduced to 1 time; the complex query, grouping and blocking processing of the source data are parallel processing, the data blocks are stored by adopting streaming processing, the grouping information is added into the data blocks, the speed of the data entering a data mart is increased, the data blocks are compressed, and the use of storage space is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow diagram of a method for complex query grouping, such as data marting, provided by the present invention;

FIG. 2 is a flow chart illustrating grouping determination of source data according to the present invention;

FIG. 3 is a schematic diagram of a display of a grouping class interface according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a method for grouping data into a data mart aiming at complex query, which comprises the following specific steps:

s1: loading source data, performing complex query on the source data, adding sequencing information of grouped columns in the complex query, performing feature analysis on the source data, and storing the source data meeting the segmentation features into a memory; otherwise, the source data generates data blocks according to the sequence of the sequencing information and leads the data blocks into the data mart nodes, wherein the process of generating the data blocks is that the source data is traversed according to the sequence, when the traversed data meets the size of one data block, one data block is generated, and the next data block is started to be generated until the traversal is finished;

performing feature analysis on the source data by performing additional query on the source data in a grouping column, judging whether the source data meets a grouping condition,

s11: acquiring the type of the grouping column, the number of different values and the average data volume of each grouping class;

s12: if the packet type is not data and the number of different values is less than the set maximum value, when the average data volume is greater than the loaded source data, the source data meets the packet condition;

s2: segmenting source data in a memory according to different values of a grouping column to obtain data blocks;

s3: adding metadata information in the data block to obtain an enhanced data block;

s4: and compressing the enhanced data block to obtain a compressed data block, and distributing the compressed data block to the data mart node.

In order to further optimize the technical scheme, a source data loading complex query process, a source data segmentation process and a process of adding information compression data blocks into a data mart are respectively set into three thread models, and data flow processing is adopted for parallel processing.

In order to further optimize the technical scheme, when the complex query is performed in S1, the source data are sorted according to the sorting information of the grouping columns in the data query layer, and the same values can be gathered together, so that the memory usage is reduced and the efficiency is improved when the data blocks are divided into data blocks in the memory in the subsequent S2.

In order to further optimize the technical scheme, the grouped data blocks have the same data characteristics, and the data characteristics are recorded as metadata information. The metadata information can be used as a filtering condition when the data mart is adopted to query the data, so that the efficiency of filtering the query is very high, and meanwhile, a specific data block can be deleted according to the metadata information of the data block.

In order to further optimize the above technical solution, the loading of the source data is streaming loading.

In order to further optimize the technical scheme, if a row sequence is supported and set in a query API interface of complex query, such as a data warehouse Hive, and all relational databases supporting SQL grammar, such as MYSQL, ORACLE and the like, the sequencing information is added to the API interface, and the sequencing operation pressure is pushed down to a data mart; otherwise, finishing the sorting operation of the source data by using a TimShort sorting algorithm.

Examples

First, the ranking information of the grouped columns is added to the complex query. According to the characteristics of the complex query data source, the method for sorting the result set of the complex query is different: if the arrangement sequence (for example, Hive, all relational databases supporting SQL syntax: MYSQL, ORACLE, and the like) is supported and set in the query API interface of the complex query data source, adding the sequencing information into the API interface of the complex query, so that the sequencing operation pressure can be pushed down to the database; if the complex query data source query API does not support the setting of the ranking sequence, using a TimShort ranking algorithm to complete the ranking operation of the source data;

in the second step, in order to ensure that the data size of each data mart data block is consistent, the number of different values of the grouping column and the total number count value of each different value are required to be inquired for judging whether the grouping is suitable. And performing simple aggregation query on the source data by using the grouped columns as dimensions and taking the number of the grouped columns as a measure. If the total row number of the query result is greater than the limit value 3000 of the different value numbers of the grouping rows, the query is not suitable for grouping; if the average value of all different values of the query result group column is smaller than 262144, judging that the query is not suitable for dividing the group into the market;

grouping column after the user selects the imported data source, the data grouping column is presented in an interface manner for the user to select, as shown in fig. 3. The user-selected grouping column should satisfy the following condition: the number of different values is less than 3000, the average data volume of each different value group is greater than 262144 (18 times of 2) the data volume of a data block, the type of the group column is generally a character string type, a Boolean type or a date type, and the value type cannot be used;

the packet feasibility check on the packet train is an additional query of the source data. The query takes the grouped columns as the groups and takes the number of different values as the aggregation. Performing feasibility judgment according to the result (number of queried result sets, each different value data quantity) obtained by query, wherein the specific steps of judgment are shown in fig. 2;

thirdly, dividing the source data into data blocks in the memory according to the grouping columns, and adding metadata information into the data blocks;

and segmenting the source data according to the grouping columns, namely traversing the streaming-loaded source data, and putting the values belonging to the same grouping columns in the source data into the same data block. Because the source data is sequenced, when a next different value is met, the data block which marks the previous different value is processed, and the meta-information in the data block is added with the information of different values of the fragmentation column for distribution and storage;

and fourthly, compressing the generated data block and distributing the data block to the data mart nodes. The method comprises the steps of respectively inputting a result queried by source data into a result set and executing the process of dividing the result set (namely dividing the source data into data blocks) by using different threads through the support of a high-level programming language on a multithreading technology, simultaneously obtaining a query result as a non-blocking implementation, and sharing the query result among the threads. Thus, the streaming processing mode of the data can be ensured. Meanwhile, the second step, the third step and the fourth step are executed in different threads, and the multi-thread execution can ensure that the multi-thread execution runs in parallel.

The parallel processing refers to asynchronous stream loading of source data, the source data are divided into groups in a memory, and distribution of the divided data blocks is performed in parallel.

The invention has the beneficial effects that:

1) the number of complex queries can be reduced to 1;

2) meanwhile, the data is processed by taking block as a unit, so that the memory pressure is greatly reduced;

3) the step of importing the source data into the data mart can be executed in parallel, the speed of importing the data mart is improved, and the time spent on importing the data mart is greatly reduced.

The specific implementation mode of the rapid data mart entering method adopted in the data generation data mart system is as follows:

1. after analyzing the analysis and extraction requirements of the source data, the user confirms that the situation is a complex scene for inquiring the data mart. Then, a data set A is newly established in the system, and then a complex query statement is written into the data set A;

2. newly building an increase market-entering task in the system, wherein the data set selected in the task is the data set A newly built in the step 1, and at the moment, the system prompts a user to select a grouping column, and the user selects the grouping column according to the requirement;

3. storing the newly-built imported market-gathering task, and then running;

4. and observing the task running condition on a task control panel page of the system, wherein the successful task running represents the successful import of the mart task.

The steps are shown, the implementation steps of the scheme are simple and convenient, most of work is completed by the charge of the system, and the method has the advantage of low operation threshold.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A grouping data mart method aiming at complex query is characterized by comprising the following specific steps:

2. The method for grouping the complex queries into the data mart according to claim 1, wherein a process of loading the complex queries by the source data, a process of splitting the source data, and a process of adding the information compressed data blocks into the data mart are respectively set as three thread models, and parallel processing is performed by adopting data streaming processing.

3. The method of claim 1, wherein the source data is sorted by a data query layer according to the sorting information of the grouped columns when the complex query is performed in the step 1.

4. The method of claim 1, wherein the grouped data blocks have the same data characteristics, and the data characteristics are recorded as the metadata information.

5. The method of claim 1, wherein the loading of the source data is streaming loading.

6. The method of claim 1, wherein if a set ordering is supported in a query API interface of the complex query, the ordering information is appended to the API interface, and ordering operation pressure is pushed down to the data mart; otherwise, finishing the sorting operation of the source data by using a TimShort sorting algorithm.

7. The method for clustering data into a complex query according to claim 1, wherein the performing the feature analysis on the source data in step 1 is to perform an additional query on the source data by using the grouping column, and determine whether the source data satisfies the grouping condition, and the specific process is as follows: