CN112632068A

CN112632068A - Solution for rapidly providing mass data query service

Info

Publication number: CN112632068A
Application number: CN202011529950.5A
Authority: CN
Inventors: 余丽玲; 刘跃红; 管正爽; 周小润
Original assignee: Yinsheng Payment Service Co Ltd
Current assignee: Yinsheng Payment Service Co Ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-04-09

Abstract

The invention discloses a solution for rapidly providing mass data query service, relating to the technical field of data query; the method comprises the following steps: s1, creating a TTL table; s2, adding a data de-duplication updated identification field in the TTL table; s3, triggering synchronous data increasing operation when the interval set time or the number of data reaches the set number; s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse; s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records; s6, modifying the duration of the TTL table reserved data according to the service requirement, and deleting the outdated data regularly according to the version number; the invention has the beneficial effects that: the method can inquire the updated data in real time, return the required data according to the requirements and solve the problem that a plurality of redundant data appear before the update is triggered.

Description

Solution for rapidly providing mass data query service

Technical Field

The invention relates to the technical field of data query, in particular to a solution for rapidly providing mass data query service.

Background

The rapid development of services enables the data volume to rapidly increase, and the rapidly increasing data gradually becomes a bottleneck for providing service efficiency. A second-level display is substantially unlikely to be possible for a ten million-level tabular relational database.

Hbase, elastic search, drive, Tidb, Mongodb and ClickHouse are common for storing massive data at present, wherein:

(1) HBase: the column type storage brings great limitation to data reading, query instantaneity is general, all queries need to depend on Key, HBase is not suitable for application scenarios with complex data structures and complex queries.

(2) Field type cannot be modified, write performance is low, high performance resource consumption is achieved, and format re-modification may be required if the elastic search is searched. When aggregation is performed, the ES cannot achieve 100% accuracy due to the adoption of an approximate algorithm.

(3) Drive does not support data updating; at present, the deduplication is non-precise and does not support the association operation.

(4) And the Tidb requires an SSD solid-state disk with high requirements on production environment hardware, and has high requirements on the use of a CPU and a memory and a network.

(5) The Mongodb and the cdh cluster data are in an isolated state, and a data isolated island is easily generated along with the development of services. Split migration of blocks can have performance impact on the system when the amount of data is large.

(6) ClickHouse, the lack of high frequency, low latency ability to modify or delete existing data; can only be used for bulk deletion or modification of data.

The advantages of the ClickHouse in the aspects of high throughput writing, disk storage, high compression ratio, multi-core parallel processing, quick response to query requests, complex data type support and the like are outstanding, and the ClickHouse is selected as the data storage to provide data query service.

However, there are still some problems: in the absence of the capability of modifying or deleting existing data with high frequency and low delay, the table engine of the MergeTree series supports data updating, but the updating triggering time only occurs at the time of partition merging, the table engine can only ensure final consistency, and a great deal of redundant data can be inquired before partition merging.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a solution for rapidly providing mass data query service, which can query updated data in real time, return required data according to requirements and solve the problem that a plurality of redundant data appear before the update is triggered.

The technical scheme adopted by the invention for solving the technical problems is as follows: in a solution for providing a fast mass data query service, the improvement comprising the steps of:

s1, creating a TTL table;

s2, adding a data de-duplication updated identification field in the TTL table;

s3, triggering synchronous data increasing operation when the interval set time or the number of data reaches the set number;

s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse;

s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records;

and S6, modifying the duration of the reserved data of the TTL table according to the service requirement, and deleting the outdated data regularly according to the version number.

Further, in step S1, a TTL table of the replacegmergee engine is created according to the service requirement.

Further, in step S2, the primary key is used as the identification field.

Further, in step S2, in each group of data, the row with the largest version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data.

Further, in step S2, the custom deletion flag bit deleteflag, 0 indicates that the data is not deleted, and 1 indicates that the data is deleted.

Further, in step S4,

when data is updated, a row of new data in insert data is updated, the version number field create _ time is the latest insertion time of the system, and a modified field value is obtained by inquiring the maximum version number field create _ time;

when deleting data, a row of new data in insert data, the version number field create _ time is the latest insertion time of the system, deleteflag is changed into 1, the rest is 0, and the deleted data can be divided by inquiring the deleteflag state of the maximum version number field create _ time.

Further, in step S3, the set time is 1 min.

Further, in step S3, the number of data pieces set is 5000.

The invention has the beneficial effects that: the millisecond-level response to the query request can be realized; the updated data can be inquired in real time, the required data can be returned according to the requirement, and the problem that a plurality of redundant data appear before the update is triggered is solved; the deleted data can also be retrieved after the delete operation.

Drawings

Fig. 1 is a schematic flow chart of a solution for rapidly providing mass data query service according to the present invention.

Fig. 2 is a diagram of an embodiment of a solution for rapidly providing mass data query service according to the present invention.

Fig. 3 is a schematic flow chart of a solution for rapidly providing mass data query service according to the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The conception, the specific structure, and the technical effects produced by the present invention will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the features, and the effects of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and those skilled in the art can obtain other embodiments without inventive effort based on the embodiments of the present invention, and all embodiments are within the protection scope of the present invention. In addition, all the connection/connection relations referred to in the patent do not mean that the components are directly connected, but mean that a better connection structure can be formed by adding or reducing connection auxiliary components according to specific implementation conditions. All technical characteristics in the invention can be interactively combined on the premise of not conflicting with each other.

Example 1

Referring to fig. 1 and 2, the present invention discloses a solution for rapidly providing a mass data query service, and specifically, the method includes the following steps:

s1, creating a TTL table;

in step S1, a TTL table of a replacegmergee engine is created according to a service requirement;

s2, adding a data de-duplication updated identification field in the TTL table; in this embodiment, the primary key is used as the identification field; in each group of data, the row with the maximum version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data; the custom deletion flag bit deleteflag, 0 indicates no deletion, and 1 indicates deletion data.

S3, triggering synchronous data increasing operation at set time intervals; in this embodiment, the set time is 1 min;

s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse; in the step S4, in the step S,

when data is deleted, a row of new data in insert data is stored, the version number field create _ time is the latest insertion time of the system, deleteflag is changed into 1, the rest is 0, and the deleted data can be divided by inquiring the deleteflag state of the maximum version number field create _ time;

referring to fig. 1, the data is derived from databases such as Mongddb, oracle, hive, and the like, and after incremental data is synchronized by tools such as DateX or waterdrop, the operation in step S4 is performed;

in the present embodiment, this is achieved by executing the following program:

CREATE VIEW view_table_a AS

SELECT uniqueId,

argMax(column1,create_time)AS column1,

argMax(column2,create_time)AS column2,

argMax(columnN,create_time)AS columnN from table GROUP BY uniqueId HAVING deleted＝0；

Example 2

Referring to fig. 2 and fig. 3, the present invention discloses a solution for rapidly providing a mass data query service, and specifically, the method includes the following steps:

s1, creating a TTL table;

s2, adding a data de-duplication updated identification field in the TTL table; in this embodiment, the primary key is used as the identification field;

in each group of data, the row with the maximum version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data; the custom deletion flag bit deleteflag, 0 indicates no deletion, and 1 indicates deletion data.

S3, triggering synchronous data increasing operation when the number of data reaches the set number; in this embodiment, the number of the set data pieces is 5000;

in the present embodiment, this is achieved by executing the following program:

CREATE VIEW view_table_a AS

SELECT uniqueId,

argMax(column1,create_time)AS column1,

argMax(column2,create_time)AS column2,

With the above embodiment, the present invention discloses a solution for quickly providing mass data query service, and has the following advantages: firstly, a millisecond-level response inquiry request can be realized; secondly, updated data can be inquired in real time, required data can be returned according to requirements, and the problem that a plurality of redundant data appear before updating is triggered is solved; thirdly, the deleted data can be found after the deletion operation; fourthly, the problem that asynchronous data combination is continuously carried out on the bottom layer of ClickHouse line-by-line insert or small-batch insert, update and delete operation to influence query performance is solved.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A solution for rapidly providing mass data query service is characterized in that the method comprises the following steps:

s1, creating a TTL table;

s2, adding a data de-duplication updated identification field in the TTL table;

2. The method as claimed in claim 1, wherein in step S1, a TTL table of a replacegmergree engine is created according to a service requirement.

3. The method as claimed in claim 1, wherein in step S2, the primary key is used as the identification field.

4. The solution for rapidly providing query service for mass data as claimed in claim 1, wherein in step S2, in each group of data, the row with the maximum version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data.

5. The method as claimed in claim 4, wherein in step S2, the custom deletion flag bit deleteflag, 0 indicates not deleted and 1 indicates deleted data.

6. The solution for rapidly providing mass data query service as claimed in claim 5, wherein, in step S4,

7. The method as claimed in claim 1, wherein in step S3, the time is set to be 1 min.

8. The method as claimed in claim 1, wherein in step S3, the number of data is 5000.