CN112632068A - Solution for rapidly providing mass data query service - Google Patents

Solution for rapidly providing mass data query service Download PDF

Info

Publication number
CN112632068A
CN112632068A CN202011529950.5A CN202011529950A CN112632068A CN 112632068 A CN112632068 A CN 112632068A CN 202011529950 A CN202011529950 A CN 202011529950A CN 112632068 A CN112632068 A CN 112632068A
Authority
CN
China
Prior art keywords
data
time
version number
field
updated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011529950.5A
Other languages
Chinese (zh)
Inventor
余丽玲
刘跃红
管正爽
周小润
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yinsheng Payment Service Co Ltd
Original Assignee
Yinsheng Payment Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yinsheng Payment Service Co Ltd filed Critical Yinsheng Payment Service Co Ltd
Priority to CN202011529950.5A priority Critical patent/CN112632068A/en
Publication of CN112632068A publication Critical patent/CN112632068A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a solution for rapidly providing mass data query service, relating to the technical field of data query; the method comprises the following steps: s1, creating a TTL table; s2, adding a data de-duplication updated identification field in the TTL table; s3, triggering synchronous data increasing operation when the interval set time or the number of data reaches the set number; s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse; s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records; s6, modifying the duration of the TTL table reserved data according to the service requirement, and deleting the outdated data regularly according to the version number; the invention has the beneficial effects that: the method can inquire the updated data in real time, return the required data according to the requirements and solve the problem that a plurality of redundant data appear before the update is triggered.

Description

Solution for rapidly providing mass data query service
Technical Field
The invention relates to the technical field of data query, in particular to a solution for rapidly providing mass data query service.
Background
The rapid development of services enables the data volume to rapidly increase, and the rapidly increasing data gradually becomes a bottleneck for providing service efficiency. A second-level display is substantially unlikely to be possible for a ten million-level tabular relational database.
Hbase, elastic search, drive, Tidb, Mongodb and ClickHouse are common for storing massive data at present, wherein:
(1) HBase: the column type storage brings great limitation to data reading, query instantaneity is general, all queries need to depend on Key, HBase is not suitable for application scenarios with complex data structures and complex queries.
(2) Field type cannot be modified, write performance is low, high performance resource consumption is achieved, and format re-modification may be required if the elastic search is searched. When aggregation is performed, the ES cannot achieve 100% accuracy due to the adoption of an approximate algorithm.
(3) Drive does not support data updating; at present, the deduplication is non-precise and does not support the association operation.
(4) And the Tidb requires an SSD solid-state disk with high requirements on production environment hardware, and has high requirements on the use of a CPU and a memory and a network.
(5) The Mongodb and the cdh cluster data are in an isolated state, and a data isolated island is easily generated along with the development of services. Split migration of blocks can have performance impact on the system when the amount of data is large.
(6) ClickHouse, the lack of high frequency, low latency ability to modify or delete existing data; can only be used for bulk deletion or modification of data.
The advantages of the ClickHouse in the aspects of high throughput writing, disk storage, high compression ratio, multi-core parallel processing, quick response to query requests, complex data type support and the like are outstanding, and the ClickHouse is selected as the data storage to provide data query service.
However, there are still some problems: in the absence of the capability of modifying or deleting existing data with high frequency and low delay, the table engine of the MergeTree series supports data updating, but the updating triggering time only occurs at the time of partition merging, the table engine can only ensure final consistency, and a great deal of redundant data can be inquired before partition merging.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a solution for rapidly providing mass data query service, which can query updated data in real time, return required data according to requirements and solve the problem that a plurality of redundant data appear before the update is triggered.
The technical scheme adopted by the invention for solving the technical problems is as follows: in a solution for providing a fast mass data query service, the improvement comprising the steps of:
s1, creating a TTL table;
s2, adding a data de-duplication updated identification field in the TTL table;
s3, triggering synchronous data increasing operation when the interval set time or the number of data reaches the set number;
s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse;
s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records;
and S6, modifying the duration of the reserved data of the TTL table according to the service requirement, and deleting the outdated data regularly according to the version number.
Further, in step S1, a TTL table of the replacegmergee engine is created according to the service requirement.
Further, in step S2, the primary key is used as the identification field.
Further, in step S2, in each group of data, the row with the largest version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data.
Further, in step S2, the custom deletion flag bit deleteflag, 0 indicates that the data is not deleted, and 1 indicates that the data is deleted.
Further, in step S4,
when data is updated, a row of new data in insert data is updated, the version number field create _ time is the latest insertion time of the system, and a modified field value is obtained by inquiring the maximum version number field create _ time;
when deleting data, a row of new data in insert data, the version number field create _ time is the latest insertion time of the system, deleteflag is changed into 1, the rest is 0, and the deleted data can be divided by inquiring the deleteflag state of the maximum version number field create _ time.
Further, in step S3, the set time is 1 min.
Further, in step S3, the number of data pieces set is 5000.
The invention has the beneficial effects that: the millisecond-level response to the query request can be realized; the updated data can be inquired in real time, the required data can be returned according to the requirement, and the problem that a plurality of redundant data appear before the update is triggered is solved; the deleted data can also be retrieved after the delete operation.
Drawings
Fig. 1 is a schematic flow chart of a solution for rapidly providing mass data query service according to the present invention.
Fig. 2 is a diagram of an embodiment of a solution for rapidly providing mass data query service according to the present invention.
Fig. 3 is a schematic flow chart of a solution for rapidly providing mass data query service according to the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
The conception, the specific structure, and the technical effects produced by the present invention will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the features, and the effects of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and those skilled in the art can obtain other embodiments without inventive effort based on the embodiments of the present invention, and all embodiments are within the protection scope of the present invention. In addition, all the connection/connection relations referred to in the patent do not mean that the components are directly connected, but mean that a better connection structure can be formed by adding or reducing connection auxiliary components according to specific implementation conditions. All technical characteristics in the invention can be interactively combined on the premise of not conflicting with each other.
Example 1
Referring to fig. 1 and 2, the present invention discloses a solution for rapidly providing a mass data query service, and specifically, the method includes the following steps:
s1, creating a TTL table;
in step S1, a TTL table of a replacegmergee engine is created according to a service requirement;
s2, adding a data de-duplication updated identification field in the TTL table; in this embodiment, the primary key is used as the identification field; in each group of data, the row with the maximum version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data; the custom deletion flag bit deleteflag, 0 indicates no deletion, and 1 indicates deletion data.
S3, triggering synchronous data increasing operation at set time intervals; in this embodiment, the set time is 1 min;
s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse; in the step S4, in the step S,
when data is updated, a row of new data in insert data is updated, the version number field create _ time is the latest insertion time of the system, and a modified field value is obtained by inquiring the maximum version number field create _ time;
when data is deleted, a row of new data in insert data is stored, the version number field create _ time is the latest insertion time of the system, deleteflag is changed into 1, the rest is 0, and the deleted data can be divided by inquiring the deleteflag state of the maximum version number field create _ time;
referring to fig. 1, the data is derived from databases such as Mongddb, oracle, hive, and the like, and after incremental data is synchronized by tools such as DateX or waterdrop, the operation in step S4 is performed;
s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records;
in the present embodiment, this is achieved by executing the following program:
CREATE VIEW view_table_a AS
SELECT uniqueId,
argMax(column1,create_time)AS column1,
argMax(column2,create_time)AS column2,
argMax(columnN,create_time)AS columnN from table GROUP BY uniqueId HAVING deleted=0;
and S6, modifying the duration of the reserved data of the TTL table according to the service requirement, and deleting the outdated data regularly according to the version number.
Example 2
Referring to fig. 2 and fig. 3, the present invention discloses a solution for rapidly providing a mass data query service, and specifically, the method includes the following steps:
s1, creating a TTL table;
in step S1, a TTL table of a replacegmergee engine is created according to a service requirement;
s2, adding a data de-duplication updated identification field in the TTL table; in this embodiment, the primary key is used as the identification field;
in each group of data, the row with the maximum version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data; the custom deletion flag bit deleteflag, 0 indicates no deletion, and 1 indicates deletion data.
S3, triggering synchronous data increasing operation when the number of data reaches the set number; in this embodiment, the number of the set data pieces is 5000;
s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse; in the step S4, in the step S,
when data is updated, a row of new data in insert data is updated, the version number field create _ time is the latest insertion time of the system, and a modified field value is obtained by inquiring the maximum version number field create _ time;
when deleting data, a row of new data in insert data, the version number field create _ time is the latest insertion time of the system, deleteflag is changed into 1, the rest is 0, and the deleted data can be divided by inquiring the deleteflag state of the maximum version number field create _ time.
Referring to fig. 1, the data is derived from databases such as Mongddb, oracle, hive, and the like, and after incremental data is synchronized by tools such as DateX or waterdrop, the operation in step S4 is performed;
s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records;
in the present embodiment, this is achieved by executing the following program:
CREATE VIEW view_table_a AS
SELECT uniqueId,
argMax(column1,create_time)AS column1,
argMax(column2,create_time)AS column2,
argMax(columnN,create_time)AS columnN from table GROUP BY uniqueId HAVING deleted=0;
and S6, modifying the duration of the reserved data of the TTL table according to the service requirement, and deleting the outdated data regularly according to the version number.
With the above embodiment, the present invention discloses a solution for quickly providing mass data query service, and has the following advantages: firstly, a millisecond-level response inquiry request can be realized; secondly, updated data can be inquired in real time, required data can be returned according to requirements, and the problem that a plurality of redundant data appear before updating is triggered is solved; thirdly, the deleted data can be found after the deletion operation; fourthly, the problem that asynchronous data combination is continuously carried out on the bottom layer of ClickHouse line-by-line insert or small-batch insert, update and delete operation to influence query performance is solved.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A solution for rapidly providing mass data query service is characterized in that the method comprises the following steps:
s1, creating a TTL table;
s2, adding a data de-duplication updated identification field in the TTL table;
s3, triggering synchronous data increasing operation when the interval set time or the number of data reaches the set number;
s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse;
s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records;
and S6, modifying the duration of the reserved data of the TTL table according to the service requirement, and deleting the outdated data regularly according to the version number.
2. The method as claimed in claim 1, wherein in step S1, a TTL table of a replacegmergree engine is created according to a service requirement.
3. The method as claimed in claim 1, wherein in step S2, the primary key is used as the identification field.
4. The solution for rapidly providing query service for mass data as claimed in claim 1, wherein in step S2, in each group of data, the row with the maximum version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data.
5. The method as claimed in claim 4, wherein in step S2, the custom deletion flag bit deleteflag, 0 indicates not deleted and 1 indicates deleted data.
6. The solution for rapidly providing mass data query service as claimed in claim 5, wherein, in step S4,
when data is updated, a row of new data in insert data is updated, the version number field create _ time is the latest insertion time of the system, and a modified field value is obtained by inquiring the maximum version number field create _ time;
when deleting data, a row of new data in insert data, the version number field create _ time is the latest insertion time of the system, deleteflag is changed into 1, the rest is 0, and the deleted data can be divided by inquiring the deleteflag state of the maximum version number field create _ time.
7. The method as claimed in claim 1, wherein in step S3, the time is set to be 1 min.
8. The method as claimed in claim 1, wherein in step S3, the number of data is 5000.
CN202011529950.5A 2020-12-22 2020-12-22 Solution for rapidly providing mass data query service Pending CN112632068A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011529950.5A CN112632068A (en) 2020-12-22 2020-12-22 Solution for rapidly providing mass data query service

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011529950.5A CN112632068A (en) 2020-12-22 2020-12-22 Solution for rapidly providing mass data query service

Publications (1)

Publication Number Publication Date
CN112632068A true CN112632068A (en) 2021-04-09

Family

ID=75320958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011529950.5A Pending CN112632068A (en) 2020-12-22 2020-12-22 Solution for rapidly providing mass data query service

Country Status (1)

Country Link
CN (1) CN112632068A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114706836A (en) * 2022-03-29 2022-07-05 中国科学院软件研究所 Data life cycle management method based on airborne embedded database
CN116483873A (en) * 2023-06-25 2023-07-25 云筑信息科技(成都)有限公司 Method for realizing historical snapshot data query based on change data
CN117331513A (en) * 2023-12-01 2024-01-02 蒲惠智造科技股份有限公司 Data reduction method and system based on Hadoop architecture

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114706836A (en) * 2022-03-29 2022-07-05 中国科学院软件研究所 Data life cycle management method based on airborne embedded database
CN114706836B (en) * 2022-03-29 2023-01-10 中国科学院软件研究所 Data life cycle management method based on airborne embedded database
CN116483873A (en) * 2023-06-25 2023-07-25 云筑信息科技(成都)有限公司 Method for realizing historical snapshot data query based on change data
CN117331513A (en) * 2023-12-01 2024-01-02 蒲惠智造科技股份有限公司 Data reduction method and system based on Hadoop architecture
CN117331513B (en) * 2023-12-01 2024-03-19 蒲惠智造科技股份有限公司 Data reduction method and system based on Hadoop architecture

Similar Documents

Publication Publication Date Title
CN112632068A (en) Solution for rapidly providing mass data query service
CN110825748B (en) High-performance and easily-expandable key value storage method by utilizing differentiated indexing mechanism
CN109299113B (en) Range query method with storage-aware mixed index
US10331641B2 (en) Hash database configuration method and apparatus
CN102169507B (en) Implementation method of distributed real-time search engine
US9875024B2 (en) Efficient block-level space allocation for multi-version concurrency control data
CA2723731C (en) Managing storage of individually accessible data units
US7418544B2 (en) Method and system for log structured relational database objects
US9495398B2 (en) Index for hybrid database
CN103595797B (en) Caching method for distributed storage system
US10289709B2 (en) Interleaved storage of dictionary blocks in a page chain
CN111427847A (en) Indexing and query method and system for user-defined metadata
CN102968456B (en) A kind of raster data reading and processing method and device
CN111427885B (en) Database management method and device based on lookup table
US20170316042A1 (en) Index page with latch-free access
CN111857582B (en) Key value storage system
CN112269802A (en) Method and system for frequent deletion, modification and check optimization based on Clickhouse
CN116821127A (en) Method for realizing hash index of kv stored distributed database
CN113672583B (en) Big data multi-data source analysis method and system based on storage and calculation separation
WO2015129109A1 (en) Index management device
CN114895850A (en) Method for optimizing writing of data lake
CN114357104A (en) Non-index map slice polymerization method
CN116226038A (en) File system acceleration method and system based on LevelDB
AU2015258326B2 (en) Managing storage of individually accessible data units
CN117633105A (en) Time-series data storage management method and system based on time partition index

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination