CN113297270A

CN113297270A - Data query method and device, electronic equipment and storage medium

Info

Publication number: CN113297270A
Application number: CN202110380490.2A
Authority: CN
Inventors: 李福宜; 赵彦林; 李周; 王平; 陈宏伟; 何建锋
Original assignee: Xi'an Jiaotong University Jump Network Technology Co ltd
Current assignee: Xi'an Jiaotong University Jump Network Technology Co ltd
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2021-08-24

Abstract

The invention discloses a real-time data query method, a real-time data query device, a real-time data query equipment and a real-time data query storage medium, wherein an ES and a ClickHouse consume data from the same source data and respectively store the data, and different engines are adopted to perform data response according to target data volumes pointed by different query requests, so that the obvious disadvantages of the Es in data de-duplication and counting are overcome, the flexibility of the Es in aggregation nesting is fully utilized, the large-scale data is rapidly aggregated and analyzed, a result is returned, the effect of approximate real-time is achieved, and the value and the meaning of a data query result are improved.

Description

Data query method and device, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of data analysis, and particularly relates to a method and a device for querying data in real time under large-scale data volume, electronic equipment and a storage medium.

Background

With the advent of the big data age, traditional data analysis methods face enormous challenges due to the explosive growth of data volume on the one hand and the increase of data types on the other hand. Efficient request response is crucial to the effective implementation of big data services. In order to meet the requirement of fast processing of some specific queries and data mining applications, the database needs to perform statistical analysis on some data fields according to various dimensions or combination of multiple dimensions, such as summing of data groups, number calculation, maximum value, minimum value, or other customized statistical functions, and aggregate to obtain some specific data overviews. For example, in practical applications, when a user inputs a keyword (e.g., "mobile phone") to search, the keyword is subjected to statistical analysis of related data to obtain an aggregation result of the related data (e.g., "5G", "wireless charging", "curved screen", and other themes of goods), and a related data set can be obtained by filtering the themes, so as to quickly achieve a search purpose.

An elastic search (ES for short) is a distributed full-text search engine based on the Lucene underlying technology, and can achieve fast query to a certain extent through a mechanism for improving data warehousing and filtering performance, but the ES has an obvious short board on large-scale data statistics and data deduplication, and in the face of large-scale data volume, searching, filtering and aggregate analysis of data according to different services will consume more resources, so that in order to ensure normal operation of services, the whole aggregate analysis needs to be optimized, and a better query service is achieved.

Disclosure of Invention

In view of the above technical background, the present invention aims to provide a method, an apparatus, a device and a storage medium for querying data in real time, so as to increase the processing speed of data query and speed up the response of query request.

In a first aspect, a method for querying data in real time is provided, where the method includes a query request for obtaining real-time data: if the target data volume pointed by the query request is smaller than a first threshold value, acquiring query content from an elastic search or a ClickHouse; if the target data volume pointed by the query request is larger than a first threshold value, the ClickHouse is used for carrying out repeated statistics on the total data volume, data are taken out, the data are input into the ElasticSearch one by one to carry out filtering and sub-aggregation, aggregation results are summarized and returned.

Preferably, if the target data amount pointed by the query request is greater than the second threshold: taking the first threshold as a unit, and carrying out batch statistics on the target data volume by the ClickHouse; inputting each batch of data into an elastic search one by one for filtering and sub-polymerization, and summarizing to obtain a polymerization result of the batch of data; and aggregating the aggregation result of each batch of data and returning.

And before the step of judging the magnitude of the target data pointed by the query request, the method further comprises the following steps: and (3) shunting and storing the target data acquired in real time to Topic of Kafka according to data types, consuming data from the same Topic by a ClickHouse and an ElasticSearch and storing the data respectively, wherein the ClickHouse only stores field data participating in aggregation analysis.

Preferably, if the target data size pointed by the query request is greater than a first threshold, counting deduplication and multidimensional deduplication are performed by a clickwouse, total data record number and data participating aggregation of each page are obtained through statistics, and then data is fetched page by page and input into an ElasticSearch item by item for processing.

Further, in each step, the elastic search filters the input data by using a filter query.

In a second aspect, there is provided a data query apparatus, including:

the query receiving module is used for acquiring a real-time query request initiated to the data and analyzing the real-time query request to obtain an aggregation analysis dimension;

the query judging module is used for judging whether the target data volume pointed by the query request is larger than a preset first threshold value or not;

and the query processing module is used for initiating corresponding data aggregation analysis according to the target data volume pointed by the query request and returning an aggregation result.

Preferably, the query processing module is configured to:

if the target data volume pointed by the query request is smaller than a first threshold value, acquiring query content from an elastic search or a ClickHouse;

if the target data volume pointed by the query request is larger than a first threshold value, performing repeated statistics on the total data volume by using a ClickHouse, taking out data, inputting the data into an ElasticSearch one by one for filtering and sub-aggregation, summarizing an aggregation result and returning;

if the target data volume pointed by the query request is larger than a second threshold value: taking the first threshold as a unit, and carrying out batch statistics on the target data volume by the ClickHouse; inputting each batch of data into an elastic search one by one for filtering and sub-polymerization, and summarizing to obtain a polymerization result of the batch of data; and aggregating the aggregation result of each batch of data and returning.

Further, the apparatus further comprises:

and the data storage module is used for storing the target data acquired in real time to Topic of Kafka in a shunting manner according to data types, and the ClickHouse and the ElasticSearch consume data from the same Topic and store the data respectively, wherein the ClickHouse only stores field data participating in aggregation analysis.

In a third aspect, a data query device is provided, the device comprising a memory, a processor and a data query program stored on the memory and operable on the processor, the data query program, when executed by the processor, implementing the steps of the data query method as described above.

In a fourth aspect, a computer-readable storage medium is provided, on which a data query program is stored, which when executed by a processor implements the steps of the data query method as described above.

By adopting the technical content, the data real-time query method, the data real-time query device, the electronic equipment and the computer scale storage medium provided by the embodiment of the invention have the following beneficial effects: the ES and the ClickHouse consume data from the same source data and store the data respectively, and different engines are adopted to perform data response according to target data volumes pointed by different query requests, so that the obvious disadvantages of Es in data deduplication and counting are overcome, the flexibility of Es in aggregation nesting is fully utilized, the large-scale data is rapidly aggregated and analyzed, results are returned, the effect of approximate real-time is achieved, and the value and the meaning of data query results are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention are briefly described, the drawings described below are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. It is to be noted that any elements in the figures are meant to be exemplary rather than limiting and that any nomenclature is used for distinction only and not in any limiting sense.

FIG. 1 is a schematic diagram illustrating a workflow of a real-time data query method according to a first embodiment of the present application;

FIG. 2 is a schematic diagram of a data flow of a real-time query according to a second embodiment of the present application;

fig. 3 is a schematic diagram illustrating a module composition of a data real-time query device according to a third embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

First, the relevant contents of data analysis are briefly introduced to better understand the technical solution of the embodiments of the present application.

Data processing can be broadly divided into two broad categories: online transaction Processing (OLTP) and online Analytical Processing (OLAP). OLTP is the primary application of traditional relational databases, primarily for basic, everyday transactions such as banking transactions. OLAP is a major application of data warehouse systems, supports complex analytical operations, emphasizes decision support, and provides intuitive and understandable query results.

Different from the transaction processing (OLTP) scenario, such as shopping cart adding, order placing, payment and the like in the e-market scenario, a large number of insert, update and delete operations need to be performed in situ, and the data analysis (OLAP) scenario generally performs flexible exploration, BI tool insight, report making and the like of any dimension after data is imported in batches. After the data is written once, data mining and analyzing from various angles are tried until information such as business value, business change trend and the like is found. This is a process that requires trial and error, constant adjustment, and continuous optimization, where data is read much more often than written. This requires the underlying database to be specifically designed for this feature.

Example one

As shown in FIG. 1, a real-time data query method comprises

Acquiring a query request of real-time data, and judging the scale of query target data;

if the target data volume pointed by the query request is smaller than a first threshold value, acquiring query content from an elastic search (for convenience of description, hereinafter referred to as ES) or a ClickHouse;

if the target data volume pointed by the query request is larger than a first threshold value, the ClickHouse is used for carrying out duplication statistics on the total data volume, data are taken out, ES is input one by one to carry out filtering and sub-aggregation, aggregation results are summarized and returned.

Nested aggregation is to sequentially aggregate data of a plurality of fields, for example, to aggregate a "gender" field and then to perform nested aggregation (sub-aggregation) on an "age" field, that is, one aggregation is nested in another aggregation.

The aggregation in the ES is composed of Buckets (collections of documents satisfying a certain condition) and Metrics (Metrics, which are statistical calculations on the documents in the Buckets), and the aggregation may have only one bucket, only one metric, or both, or some Buckets may be nested inside other Buckets. Since buckets can be nested, an ES can achieve very many and very complex aggregations. Since the specific technical details of the filtering and aggregation implemented by the ES are not limited in the present application, detailed descriptions thereof are omitted.

The ClickHouse engine is an open-source online analytical processing (OLAP) oriented columnar storage database management system. The column type storage has many excellent characteristics in an analysis scene, and by using the column type storage, under a specific analysis scene, higher acceleration effect can be obtained, including:

in the column memory mode, numberThe data of all columns are stored in one block according to the continuous storage of rows, the columns which do not participate in calculation are all read out at IO, and the reading operation is greatly amplified. In the column storage mode, only the columns participating in calculation need to be read, so that IO cost is greatly reduced, and query is accelerated.

The data in the same column belong to the same type, and the compression effect is remarkable. The column memory usually has a compression ratio as high as ten times or even higher, so that a large amount of storage space is saved, and the storage cost is reduced; the higher compression ratio means smaller data size, and the time for reading corresponding data from the disk is shorter; the high compression ratio also means that the memories with the same size can store more data, and the system cache effect is better; the data of different columns have different data types, the applicable compression algorithms are different, and the most suitable compression algorithm can be selected according to different column types.

Therefore, compared with the line storage, when providing the data query service, the ClickHouse is less affected by the data scale, has better performance of providing the query service with large data volume, and can improve the query efficiency. It should also be noted that the specific technical details of the clickwouse database are not within the scope of the present invention, and are not described herein.

According to the technical scheme, different engines are adopted for data response according to target data volumes pointed by different query requests, so that the obvious disadvantages of the ES in data deduplication and counting are overcome by using the ClickHouse, the flexibility of the ES in aggregation nesting is fully utilized, the large-scale data is quickly aggregated and analyzed, a query result is returned, the effect of approximate real-time is achieved, and the value and the meaning of the data are improved.

Example two

As shown in fig. 2, before acquiring the query request in the first embodiment, target data collected in real time is stored in a streaming manner to Topic of Kafka according to data types. Topic is the basic unit of a Kafka data write operation, and a producer (e.g., a respective network security device) can publish data (e.g., a security event log) into a selected Topic (Topic), each record published into a Topic being assigned to a respective consumer instance in a subscription consumption group, where the consumer instances can be distributed in multiple processes or on multiple machines. The clickwouse and the ES are used as data consumers in the embodiment, and data are consumed from the same Topic and stored respectively through a Flink data stream processing engine, wherein the clickwouse only stores field data participating in aggregation analysis.

Kafka is a distributed, partition-supporting, multi-copy distributed message system, and its greatest characteristic is that it can process a large amount of data in real time, and has the advantages of high throughput, low latency, scalability, durability, reliability, fault tolerance, and high concurrency, so as to meet various demand scenarios: log collection, user activity tracking, streaming, etc. Flink is a streaming data stream execution engine that provides functions such as data distribution, data communication, and fault tolerance mechanisms for distributed computation of data streams. The specific technical details of Kafka and Flink are not limited in the technical solution of the present application, and are not described herein again.

As a preferred embodiment, if the producer is a network security device, the log generated by the device is written into the corresponding Topic according to the security detection log, the network traffic log, the protocol audit log and the third-party device input log, the ClickHouse and the ES consume the log data from the same Topic in sequence and store the log data respectively, but the ClickHouse only stores the security detection log and the network traffic log participating in the aggregation analysis, so that when the target data of the query request is of the type, the ClickHouse participates in the statistical calculation to generate the paged list data input ES. And, the input ES is input page by page.

And if the query request is not directed to a security detection log or a network flow log participating in the aggregation analysis but to a protocol audit log not participating in the aggregation analysis, processing and responding to the query request through the ES.

If the query request points to a security detection log or a network traffic log participating in aggregation analysis, further judging the data scale of the pointed log:

if the target data volume pointed by the query request is less than 10 hundred million, the ES and the ClickHouse have not very large difference in data aggregation analysis statistics under the order of magnitude, so that query contents are obtained from the ElasticSearch or the ClickHouse;

if the target data volume pointed by the query request is more than 10 hundred million, the total data volume is counted by the ClickHouse through deduplication, and because the flexible aggregation nesting supported by the ES on aggregation analysis is hard to reach by the ClickHouse SQL syntax, the data is taken out, the ES is input one by one to be filtered and sub-aggregated, and the aggregation result is summarized and returned.

Preferably, if the query request is directed to a target amount of data greater than 100 billion: the ClickHouse counts the target data quantity in batches by 10 hundred million (for example, the target quantity is 100 hundred million, the data is divided into at least 10 batches for processing), including counting and multi-dimensional deduplication; paging each batch of data, inputting the first page of data into an ES (electronic storage) one by one to perform filtering and sub-aggregation, continuously taking the second page until the batch of data is completely processed, and summarizing to obtain an aggregation result of the batch of data; and continuously processing the data of each batch in sequence, summarizing the aggregation result of each batch of data, and returning.

In the above steps, the ES filters the input data by using a filter query. The ES provides two query modes of query and filter, wherein the filter query is performed on the basis of the data queried by the query. The two types of queries differ: the query method calculates the correlation between the query condition and the data to be queried, and the calculation result is written into a score field, similar to a search engine. The filter only performs character string matching, the correlation cannot be calculated, the method is similar to general data query, data queried by the filter can be automatically cached, and query cannot be performed, so that the query speed of the filter is higher than that of the query.

According to the data query method, different engines are adopted to perform data response according to target data volumes (billions and billions) pointed by different query requests, so that the clear disadvantages of the ES in data deduplication and counting are overcome by using the ClickHouse, the flexibility of the ES in aggregation nesting is fully utilized, the rapid aggregation analysis of the huge-scale data is realized, the query result is returned, the effect of approximate real-time is achieved, and the value and the meaning of the data are improved.

EXAMPLE III

As shown in fig. 3, there is provided a data query apparatus including:

Preferably, the query processing module is configured to:

Further, the apparatus further comprises:

According to the data query device, different engines are adopted to perform data response according to target data volumes (billions and billions) pointed by different query requests, so that the clear disadvantages of the ES in data deduplication and counting are overcome by using the ClickHouse, the flexibility of the ES in aggregation nesting is fully utilized, the rapid aggregation analysis of the huge-scale data is realized, the query result is returned, the effect of approximate real-time is achieved, and the value and the meaning of the data are improved.

Example four

A data querying device, the device comprising a memory, a processor and a data querying program stored on the memory and operable on the processor, the data querying program, when executed by the processor, implementing the steps of the data querying method as described above.

Based on the data query method, a computer-readable storage medium is provided, and a data query program on the computer-readable storage medium is provided, wherein when the data query program is executed by a processor, the steps of the data query method are realized.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A real-time data query method is characterized by comprising the following steps of obtaining a query request of real-time data: if the target data volume pointed by the query request is smaller than a first threshold value, acquiring query content from an elastic search or a ClickHouse; if the target data volume pointed by the query request is larger than a first threshold value, the ClickHouse is used for carrying out repeated statistics on the total data volume, data are taken out, the data are input into the ElasticSearch one by one to carry out filtering and sub-aggregation, aggregation results are summarized and returned.

2. The method according to claim 1, wherein if the query request is directed to a target amount of data greater than a second threshold: taking the first threshold as a unit, and carrying out batch statistics on the target data volume by the ClickHouse; inputting each batch of data into an elastic search one by one for filtering and sub-polymerization, and summarizing to obtain a polymerization result of the batch of data; and aggregating the aggregation result of each batch of data and returning.

3. The method of claim 1, wherein before determining the magnitude of the target data targeted by the query request, the method further comprises: and (3) shunting and storing the target data acquired in real time to Topic of Kafka according to data types, consuming data from the same Topic by a ClickHouse and an ElasticSearch and storing the data respectively, wherein the ClickHouse only stores field data participating in aggregation analysis.

4. The query method according to claim 1, wherein if the target data amount pointed by the query request is greater than the first threshold, counting deduplication and multidimensional deduplication are performed by a clickwouse, a total record number of data and data participating in aggregation per page are obtained through statistics, and then the data is fetched page by page and is input into an ElasticSearch item by item for processing.

5. The query method of any one of claims 1-4, wherein the ElasticSearch filters the input data with a filter query.

6. A data query apparatus, comprising:

7. The data query device of claim 6, wherein the query processing module is configured to:

8. The data query apparatus according to claim 6 or 7, wherein the apparatus further comprises:

9. A data query device, characterized in that the device comprises a memory, a processor and a data query program stored on the memory and executable on the processor, which data query program, when executed by the processor, implements the steps of the data query method as claimed in any one of claims 1 to 5.

10. A computer-readable storage medium, on which a data query program is stored, which when executed by a processor implements the steps of the data query method according to any one of claims 1 to 5.