CN112632068A - Solution for rapidly providing mass data query service - Google Patents
Solution for rapidly providing mass data query service Download PDFInfo
- Publication number
- CN112632068A CN112632068A CN202011529950.5A CN202011529950A CN112632068A CN 112632068 A CN112632068 A CN 112632068A CN 202011529950 A CN202011529950 A CN 202011529950A CN 112632068 A CN112632068 A CN 112632068A
- Authority
- CN
- China
- Prior art keywords
- data
- time
- version number
- field
- updated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a solution for rapidly providing mass data query service, relating to the technical field of data query; the method comprises the following steps: s1, creating a TTL table; s2, adding a data de-duplication updated identification field in the TTL table; s3, triggering synchronous data increasing operation when the interval set time or the number of data reaches the set number; s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse; s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records; s6, modifying the duration of the TTL table reserved data according to the service requirement, and deleting the outdated data regularly according to the version number; the invention has the beneficial effects that: the method can inquire the updated data in real time, return the required data according to the requirements and solve the problem that a plurality of redundant data appear before the update is triggered.
Description
Technical Field
The invention relates to the technical field of data query, in particular to a solution for rapidly providing mass data query service.
Background
The rapid development of services enables the data volume to rapidly increase, and the rapidly increasing data gradually becomes a bottleneck for providing service efficiency. A second-level display is substantially unlikely to be possible for a ten million-level tabular relational database.
Hbase, elastic search, drive, Tidb, Mongodb and ClickHouse are common for storing massive data at present, wherein:
(1) HBase: the column type storage brings great limitation to data reading, query instantaneity is general, all queries need to depend on Key, HBase is not suitable for application scenarios with complex data structures and complex queries.
(2) Field type cannot be modified, write performance is low, high performance resource consumption is achieved, and format re-modification may be required if the elastic search is searched. When aggregation is performed, the ES cannot achieve 100% accuracy due to the adoption of an approximate algorithm.
(3) Drive does not support data updating; at present, the deduplication is non-precise and does not support the association operation.
(4) And the Tidb requires an SSD solid-state disk with high requirements on production environment hardware, and has high requirements on the use of a CPU and a memory and a network.
(5) The Mongodb and the cdh cluster data are in an isolated state, and a data isolated island is easily generated along with the development of services. Split migration of blocks can have performance impact on the system when the amount of data is large.
(6) ClickHouse, the lack of high frequency, low latency ability to modify or delete existing data; can only be used for bulk deletion or modification of data.
The advantages of the ClickHouse in the aspects of high throughput writing, disk storage, high compression ratio, multi-core parallel processing, quick response to query requests, complex data type support and the like are outstanding, and the ClickHouse is selected as the data storage to provide data query service.
However, there are still some problems: in the absence of the capability of modifying or deleting existing data with high frequency and low delay, the table engine of the MergeTree series supports data updating, but the updating triggering time only occurs at the time of partition merging, the table engine can only ensure final consistency, and a great deal of redundant data can be inquired before partition merging.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a solution for rapidly providing mass data query service, which can query updated data in real time, return required data according to requirements and solve the problem that a plurality of redundant data appear before the update is triggered.
The technical scheme adopted by the invention for solving the technical problems is as follows: in a solution for providing a fast mass data query service, the improvement comprising the steps of:
s1, creating a TTL table;
s2, adding a data de-duplication updated identification field in the TTL table;
s3, triggering synchronous data increasing operation when the interval set time or the number of data reaches the set number;
s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse;
s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records;
and S6, modifying the duration of the reserved data of the TTL table according to the service requirement, and deleting the outdated data regularly according to the version number.
Further, in step S1, a TTL table of the replacegmergee engine is created according to the service requirement.
Further, in step S2, the primary key is used as the identification field.
Further, in step S2, in each group of data, the row with the largest version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data.
Further, in step S2, the custom deletion flag bit deleteflag, 0 indicates that the data is not deleted, and 1 indicates that the data is deleted.
Further, in step S4,
when data is updated, a row of new data in insert data is updated, the version number field create _ time is the latest insertion time of the system, and a modified field value is obtained by inquiring the maximum version number field create _ time;
when deleting data, a row of new data in insert data, the version number field create _ time is the latest insertion time of the system, deleteflag is changed into 1, the rest is 0, and the deleted data can be divided by inquiring the deleteflag state of the maximum version number field create _ time.
Further, in step S3, the set time is 1 min.
Further, in step S3, the number of data pieces set is 5000.
The invention has the beneficial effects that: the millisecond-level response to the query request can be realized; the updated data can be inquired in real time, the required data can be returned according to the requirement, and the problem that a plurality of redundant data appear before the update is triggered is solved; the deleted data can also be retrieved after the delete operation.
Drawings
Fig. 1 is a schematic flow chart of a solution for rapidly providing mass data query service according to the present invention.
Fig. 2 is a diagram of an embodiment of a solution for rapidly providing mass data query service according to the present invention.
Fig. 3 is a schematic flow chart of a solution for rapidly providing mass data query service according to the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
The conception, the specific structure, and the technical effects produced by the present invention will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the features, and the effects of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and those skilled in the art can obtain other embodiments without inventive effort based on the embodiments of the present invention, and all embodiments are within the protection scope of the present invention. In addition, all the connection/connection relations referred to in the patent do not mean that the components are directly connected, but mean that a better connection structure can be formed by adding or reducing connection auxiliary components according to specific implementation conditions. All technical characteristics in the invention can be interactively combined on the premise of not conflicting with each other.
Example 1
Referring to fig. 1 and 2, the present invention discloses a solution for rapidly providing a mass data query service, and specifically, the method includes the following steps:
s1, creating a TTL table;
in step S1, a TTL table of a replacegmergee engine is created according to a service requirement;
s2, adding a data de-duplication updated identification field in the TTL table; in this embodiment, the primary key is used as the identification field; in each group of data, the row with the maximum version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data; the custom deletion flag bit deleteflag, 0 indicates no deletion, and 1 indicates deletion data.
S3, triggering synchronous data increasing operation at set time intervals; in this embodiment, the set time is 1 min;
s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse; in the step S4, in the step S,
when data is updated, a row of new data in insert data is updated, the version number field create _ time is the latest insertion time of the system, and a modified field value is obtained by inquiring the maximum version number field create _ time;
when data is deleted, a row of new data in insert data is stored, the version number field create _ time is the latest insertion time of the system, deleteflag is changed into 1, the rest is 0, and the deleted data can be divided by inquiring the deleteflag state of the maximum version number field create _ time;
referring to fig. 1, the data is derived from databases such as Mongddb, oracle, hive, and the like, and after incremental data is synchronized by tools such as DateX or waterdrop, the operation in step S4 is performed;
s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records;
in the present embodiment, this is achieved by executing the following program:
CREATE VIEW view_table_a AS
SELECT uniqueId,
argMax(column1,create_time)AS column1,
argMax(column2,create_time)AS column2,
argMax(columnN,create_time)AS columnN from table GROUP BY uniqueId HAVING deleted=0;
and S6, modifying the duration of the reserved data of the TTL table according to the service requirement, and deleting the outdated data regularly according to the version number.
Example 2
Referring to fig. 2 and fig. 3, the present invention discloses a solution for rapidly providing a mass data query service, and specifically, the method includes the following steps:
s1, creating a TTL table;
in step S1, a TTL table of a replacegmergee engine is created according to a service requirement;
s2, adding a data de-duplication updated identification field in the TTL table; in this embodiment, the primary key is used as the identification field;
in each group of data, the row with the maximum version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data; the custom deletion flag bit deleteflag, 0 indicates no deletion, and 1 indicates deletion data.
S3, triggering synchronous data increasing operation when the number of data reaches the set number; in this embodiment, the number of the set data pieces is 5000;
s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse; in the step S4, in the step S,
when data is updated, a row of new data in insert data is updated, the version number field create _ time is the latest insertion time of the system, and a modified field value is obtained by inquiring the maximum version number field create _ time;
when deleting data, a row of new data in insert data, the version number field create _ time is the latest insertion time of the system, deleteflag is changed into 1, the rest is 0, and the deleted data can be divided by inquiring the deleteflag state of the maximum version number field create _ time.
Referring to fig. 1, the data is derived from databases such as Mongddb, oracle, hive, and the like, and after incremental data is synchronized by tools such as DateX or waterdrop, the operation in step S4 is performed;
s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records;
in the present embodiment, this is achieved by executing the following program:
CREATE VIEW view_table_a AS
SELECT uniqueId,
argMax(column1,create_time)AS column1,
argMax(column2,create_time)AS column2,
argMax(columnN,create_time)AS columnN from table GROUP BY uniqueId HAVING deleted=0;
and S6, modifying the duration of the reserved data of the TTL table according to the service requirement, and deleting the outdated data regularly according to the version number.
With the above embodiment, the present invention discloses a solution for quickly providing mass data query service, and has the following advantages: firstly, a millisecond-level response inquiry request can be realized; secondly, updated data can be inquired in real time, required data can be returned according to requirements, and the problem that a plurality of redundant data appear before updating is triggered is solved; thirdly, the deleted data can be found after the deletion operation; fourthly, the problem that asynchronous data combination is continuously carried out on the bottom layer of ClickHouse line-by-line insert or small-batch insert, update and delete operation to influence query performance is solved.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (8)
1. A solution for rapidly providing mass data query service is characterized in that the method comprises the following steps:
s1, creating a TTL table;
s2, adding a data de-duplication updated identification field in the TTL table;
s3, triggering synchronous data increasing operation when the interval set time or the number of data reaches the set number;
s4, converting the updated data and the deleted data into insert data, and inserting the insert data into clickhouse;
s5, creating a corresponding view, selecting a field with the maximum version number as a display field, and filtering the deleted records;
and S6, modifying the duration of the reserved data of the TTL table according to the service requirement, and deleting the outdated data regularly according to the version number.
2. The method as claimed in claim 1, wherein in step S1, a TTL table of a replacegmergree engine is created according to a service requirement.
3. The method as claimed in claim 1, wherein in step S2, the primary key is used as the identification field.
4. The solution for rapidly providing query service for mass data as claimed in claim 1, wherein in step S2, in each group of data, the row with the maximum version number field create _ time identifies the latest data, and the default value of the version number field create _ time is the insertion time of the data.
5. The method as claimed in claim 4, wherein in step S2, the custom deletion flag bit deleteflag, 0 indicates not deleted and 1 indicates deleted data.
6. The solution for rapidly providing mass data query service as claimed in claim 5, wherein, in step S4,
when data is updated, a row of new data in insert data is updated, the version number field create _ time is the latest insertion time of the system, and a modified field value is obtained by inquiring the maximum version number field create _ time;
when deleting data, a row of new data in insert data, the version number field create _ time is the latest insertion time of the system, deleteflag is changed into 1, the rest is 0, and the deleted data can be divided by inquiring the deleteflag state of the maximum version number field create _ time.
7. The method as claimed in claim 1, wherein in step S3, the time is set to be 1 min.
8. The method as claimed in claim 1, wherein in step S3, the number of data is 5000.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011529950.5A CN112632068A (en) | 2020-12-22 | 2020-12-22 | Solution for rapidly providing mass data query service |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011529950.5A CN112632068A (en) | 2020-12-22 | 2020-12-22 | Solution for rapidly providing mass data query service |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112632068A true CN112632068A (en) | 2021-04-09 |
Family
ID=75320958
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011529950.5A Pending CN112632068A (en) | 2020-12-22 | 2020-12-22 | Solution for rapidly providing mass data query service |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112632068A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114706836A (en) * | 2022-03-29 | 2022-07-05 | 中国科学院软件研究所 | Data life cycle management method based on airborne embedded database |
CN116483873A (en) * | 2023-06-25 | 2023-07-25 | 云筑信息科技(成都)有限公司 | Method for realizing historical snapshot data query based on change data |
CN117331513A (en) * | 2023-12-01 | 2024-01-02 | 蒲惠智造科技股份有限公司 | Data reduction method and system based on Hadoop architecture |
-
2020
- 2020-12-22 CN CN202011529950.5A patent/CN112632068A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114706836A (en) * | 2022-03-29 | 2022-07-05 | 中国科学院软件研究所 | Data life cycle management method based on airborne embedded database |
CN114706836B (en) * | 2022-03-29 | 2023-01-10 | 中国科学院软件研究所 | Data life cycle management method based on airborne embedded database |
CN116483873A (en) * | 2023-06-25 | 2023-07-25 | 云筑信息科技(成都)有限公司 | Method for realizing historical snapshot data query based on change data |
CN117331513A (en) * | 2023-12-01 | 2024-01-02 | 蒲惠智造科技股份有限公司 | Data reduction method and system based on Hadoop architecture |
CN117331513B (en) * | 2023-12-01 | 2024-03-19 | 蒲惠智造科技股份有限公司 | Data reduction method and system based on Hadoop architecture |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112632068A (en) | Solution for rapidly providing mass data query service | |
CN110825748B (en) | High-performance and easily-expandable key value storage method by utilizing differentiated indexing mechanism | |
CN109299113B (en) | Range query method with storage-aware mixed index | |
US10331641B2 (en) | Hash database configuration method and apparatus | |
CN102169507B (en) | Implementation method of distributed real-time search engine | |
US9875024B2 (en) | Efficient block-level space allocation for multi-version concurrency control data | |
CA2723731C (en) | Managing storage of individually accessible data units | |
US7418544B2 (en) | Method and system for log structured relational database objects | |
US9495398B2 (en) | Index for hybrid database | |
CN103595797B (en) | Caching method for distributed storage system | |
US10289709B2 (en) | Interleaved storage of dictionary blocks in a page chain | |
CN111427847A (en) | Indexing and query method and system for user-defined metadata | |
CN102968456B (en) | A kind of raster data reading and processing method and device | |
CN111427885B (en) | Database management method and device based on lookup table | |
US20170316042A1 (en) | Index page with latch-free access | |
CN111857582B (en) | Key value storage system | |
CN112269802A (en) | Method and system for frequent deletion, modification and check optimization based on Clickhouse | |
CN116821127A (en) | Method for realizing hash index of kv stored distributed database | |
CN113672583B (en) | Big data multi-data source analysis method and system based on storage and calculation separation | |
WO2015129109A1 (en) | Index management device | |
CN114895850A (en) | Method for optimizing writing of data lake | |
CN114357104A (en) | Non-index map slice polymerization method | |
CN116226038A (en) | File system acceleration method and system based on LevelDB | |
AU2015258326B2 (en) | Managing storage of individually accessible data units | |
CN117633105A (en) | Time-series data storage management method and system based on time partition index |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |