CN116010668B - Quick search method and system applied to database - Google Patents

Quick search method and system applied to database Download PDF

Info

Publication number
CN116010668B
CN116010668B CN202310281123.6A CN202310281123A CN116010668B CN 116010668 B CN116010668 B CN 116010668B CN 202310281123 A CN202310281123 A CN 202310281123A CN 116010668 B CN116010668 B CN 116010668B
Authority
CN
China
Prior art keywords
query
predicate
data
level metadata
partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310281123.6A
Other languages
Chinese (zh)
Other versions
CN116010668A (en
Inventor
简勇华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Deepexi Technology Co Ltd
Original Assignee
Beijing Deepexi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Deepexi Technology Co Ltd filed Critical Beijing Deepexi Technology Co Ltd
Priority to CN202310281123.6A priority Critical patent/CN116010668B/en
Publication of CN116010668A publication Critical patent/CN116010668A/en
Application granted granted Critical
Publication of CN116010668B publication Critical patent/CN116010668B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a quick search method and a quick search system applied to a database, wherein the method comprises the following steps: receiving query operation with predicates of a user; obtaining snapshot data; based on the snapshot data and the query operation, final row data conforming to predicate conditions is output. Snapshot data is constructed by the steps of: after the table building operation is carried out, an initial snapshot is generated; a snapshot is generated after each batch of data is written and committed. Wherein, the snapshot includes: a number of partition level metadata records. The partition level metadata record corresponds to a partition level metadata file, and the partition level metadata file is generated by statistics of file level metadata records in the partition level metadata file. The method for quickly searching the database is applied to the data of the enterprise data lake according to the data file of the bottom layer corresponding to the file-level metadata record, so that the data of the enterprise data lake can be searched, and the data needed to be used can be quickly acquired.

Description

Quick search method and system applied to database
Technical Field
The invention relates to the technical field of data retrieval, in particular to a rapid retrieval method and system applied to a database.
Background
The advent of the concept of data lakes resulted from some challenges facing businesses as to how data should be handled and stored.
Initially, each application will generate and store a large amount of data that cannot be used by other applications, which results in the generation of islands of data. Data marts then develop, and the data generated by the applications is stored in a centralized data repository, from which relevant data can be exported as needed for transmission to departments or individuals within the enterprise that desire the data.
However, data marts solve only a portion of the problem. The remaining problems, including data management, data ownership, and access control, are all in need of resolution as businesses seek to gain higher capacity to use valid data.
In order to solve the above-mentioned problems, enterprises have a strong demand for building their own data lakes, which can store not only conventional types of data but also any other types of data, and further process and analyze them thereon to produce final outputs for consumption by various programs.
A data lake is a large repository that stores a wide variety of raw data for an enterprise, where the data is available for access, processing, analysis, and transmission.
The data lake obtains raw data from multiple data sources of the enterprise, and for different purposes, the same piece of raw data may also have multiple copies of data that satisfy a particular internal model format. Thus, the data processed in the data lake may be any type of information, from structured data to completely unstructured data.
In order to realize the use of data in a data lake, enterprise users need a quick search method to realize quick acquisition of the data to be used.
Disclosure of Invention
The invention aims at providing a rapid searching method applied to a database, which realizes the searching of data of an enterprise data lake and rapid acquisition of data needed to be used.
The embodiment of the invention provides a rapid search method applied to a database, which comprises the following steps:
receiving query operation with predicates of a user;
obtaining snapshot data;
based on the snapshot data and the query operation, outputting final row data conforming to predicate conditions;
the snapshot data is constructed through the following steps:
after the table building operation is carried out, an initial snapshot is generated;
a snapshot is generated after each batch of data is written and committed.
Wherein, the snapshot includes: a number of partition level metadata records.
The partition level metadata record corresponds to a partition level metadata file, and the partition level metadata file is generated by statistics of file level metadata records in the partition level metadata file.
One file-level metadata record corresponds to one underlying data file.
Preferably, based on the snapshot data and the query operation, outputting final row data conforming to predicate conditions includes:
analyzing query operation and determining query predicates;
analyzing the snapshot data, and determining partition column predicates of each partition-level metadata record and corresponding partition-level metadata;
determining a partition level metadata record based on the query predicate and the partition column predicate of the partition level metadata, and taking the determined partition level metadata record as a target partition;
analyzing the target partition, and determining partition column predicates of each file-level metadata record and corresponding file-level metadata;
determining a file-level metadata record based on the partition column predicate and the query predicate of the file-level metadata, and taking the determined file-level metadata record as a target file;
generating a first residual predicate based on the partition column predicate of the partition level metadata of the target partition, the partition column predicate of the file level metadata corresponding to the target file, and the query predicate;
analyzing the target file, and determining each group-level metadata and a second residual predicate of the corresponding group-level metadata;
determining grouping level metadata based on the first residual predicate and the second residual predicate; taking the group-level metadata as a target group;
analyzing the target group, and determining each row of data and a corresponding third residual predicate;
determining row data based on the third residual predicate and the first residual predicate; and taking the determined row data as final row data and outputting the final row data.
Preferably, obtaining snapshot data includes:
analyzing query operation and determining query predicates;
acquiring predicate sets corresponding to each snapshot data in a snapshot library;
when a query predicate exists in the predicate set, corresponding snapshot data is extracted.
Preferably, before outputting the final row data meeting the predicate condition based on the snapshot data and the query operation, the method further comprises:
acquiring a historical query operation record of a user;
analyzing the historical query operation records, and determining a first association relation between query predicates in each historical query record and query predicates of the current query operation;
acquiring second association relations between snapshot data corresponding to query results corresponding to query predicates in each historical query record and other snapshot data in a snapshot library;
determining the priority of each obtained snapshot data based on the first association relationship and the second association relationship;
the query order of the acquired snapshot data is determined based on the order of priority from large to small.
Preferably, analyzing the historical query operation records, determining a first association relationship between the query predicates in each historical query record and the query predicates of the current query operation, including:
based on a preset quantization template, quantizing the time difference between each historical query record and the current moment to obtain a first association parameter;
calculating first similarity between query predicates in each historical query record and query predicates of the current query operation, and taking the first similarity as a second association parameter;
and taking the first association parameter and the second association parameter as a first association relation.
Preferably, determining the priority of each obtained snapshot data based on the first association relationship and the second association relationship includes:
inquiring a preset weight table based on the first association relation, and determining the weight of the query predicates corresponding to each historical query record;
inquiring a preset association value table based on the second association relation, and determining association values between snapshot data where query results corresponding to each historical query record are located and the obtained snapshot data;
based on the weights of the query predicates corresponding to each historical query record and the correlation values between the snapshot data where the query results corresponding to each historical query record are located and the obtained snapshot data, the priority of the snapshot data is calculated, and the calculation formula is as follows:
Figure SMS_1
;
in the method, in the process of the invention,
Figure SMS_2
indicating the priority of snapshot data, +.>
Figure SMS_3
Is->
Figure SMS_4
Weights of query predicates corresponding to the historical query records;
Figure SMS_5
is->
Figure SMS_6
And the historical query records correspond to the association value between the snapshot data of the query result and the obtained snapshot data.
Preferably, the determining step of the association relation between each snapshot data in the snapshot library is as follows:
based on a preset quantization template, quantizing the time difference of the generation time of the two snapshot data of the association relation to be determined, and obtaining a third association parameter;
calculating second similarity of predicate sets of two snapshot data of the association relation to be determined, and taking the second similarity as a fourth association parameter;
and taking the third association parameter and the fourth association parameter as the association relationship of the two snapshot data of the association relationship to be determined.
The invention also provides a rapid retrieval system applied to the database, comprising:
the receiving module is used for receiving query operation with predicates of a user;
the acquisition module is used for acquiring snapshot data;
the output module is used for outputting final row data conforming to predicate conditions based on the snapshot data and the query operation;
the snapshot data is constructed through the following steps:
after the table building operation is carried out, an initial snapshot is generated;
a snapshot is generated after each batch of data is written and committed.
Wherein, the snapshot includes: a number of partition level metadata records.
The partition level metadata record corresponds to a partition level metadata file, and the partition level metadata file is generated by statistics of file level metadata records in the partition level metadata file.
One file-level metadata record corresponds to one underlying data file.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a schematic diagram of a fast search method applied to a database according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating snapshot generation according to an embodiment of the present invention;
FIG. 3 is a first level block diagram of a snapshot in an embodiment of the present invention;
FIG. 4 is a first level block diagram of yet another snapshot in an embodiment of the present invention;
FIG. 5 is a schematic diagram of intra-file grouping in an embodiment of the invention;
FIG. 6 is a diagram illustrating a search step according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
The embodiment of the invention provides a quick search method applied to a database, which is shown in fig. 1 and comprises the following steps:
step S1: receiving query operation with predicates of a user;
step S2: obtaining snapshot data;
step S3: based on the snapshot data and the query operation, final row data conforming to predicate conditions is output.
The snapshot data is constructed through the following steps:
after the table building operation is carried out, an initial snapshot is generated;
a snapshot is generated after each batch of data is written and committed. As shown in fig. 2, after performing a table building operation, an initial snapshot with metadata such as table structure information, table partition information, table attribute fields, etc. is generated; generating a first snapshot of the table after a batch of data is written and submitted, and generating a second snapshot … of the table after a second batch of data is written and submitted is based on the snapshot being submitted as a basic storage method for performing multi-layer pruning retrieval.
Wherein, the snapshot includes: a number of partition level metadata records. The snapshot includes a number of partition level metadata records: the metadata of the close partitions are organized in the same partition-level metadata (the metadata comprises a plurality of adjacent partitions), and the min, max and null values of partition columns are recorded and used for filtering related predicates of the partition columns in query predicates. As shown in fig. 3, a table t1 is created, which includes three columns (id, int), (age, int), (addr, varchar), the id column is used as a partition column, after a plurality of records with id=1 to 9 are inserted, three partition-level metadata records are recorded in the snapshot, and the three partition-level metadata records correspond to 1 to 3 partitions, 4 to 6 partitions and 7 to 9 partitions respectively; after querying using predicates such as 3 < id < 7, we can locate the second partition level metadata record by the id column min, max value in the partition level metadata record, thus completing the partition level pruning operation.
The partition level metadata record corresponds to a partition level metadata file, and the partition level metadata file is generated by statistics of file level metadata records in the partition level metadata file. The file-level metadata record contains relevant statistical information of actually generated data files, including min, max, null values and the like of each non-partition column in a table in a certain data file, and is used for filtering data column related predicates in query predicates. As shown in FIG. 4, after querying using predicates like 3 < id < 7 and 42 < age < 45, we first pass through the partition column in the partition level metadata record: the min and max values of the id columns can be positioned to the second partition level metadata record, then partition level metadata files corresponding to the partition level metadata record are opened, file level filtering is carried out through file level metadata information in the partition level metadata record, and the non-partition columns of the files 4 recorded in the file level metadata record can be seen: the min, max value (41, 50) of the age column, whereby we successfully performed predicate filtering of the non-partitioned column by using the file-level metadata record, completed the file-level pruning operation.
One file-level metadata record corresponds to one underlying data file. The data files can be stored in groups, and statistical meta-information in the groups is generated, after the file-level record information is filtered based on the steps, the corresponding data files are opened, the meta-information of each group is obtained, and the grouping filtering is carried out according to predicates. As shown in fig. 5: when we perform predicate filtering such as 3 < id < 7, 42 < age < 45, and addr=jiangsu, packet 1 in file 4 can be filtered out by reading statistical information of the packets in file, thereby completing filtering at the level of packets in file. It should be noted that: this level of pruning we use the residual predicate to filter (the residual predicate is a filtering condition after the query predicate removes the partition column predicate), and partition column predicates 3 < id < 7 in the predicate are no longer needed because the data lines in a file all belong to a partition.
In one embodiment, outputting final row data that meets predicate conditions based on snapshot data and a query operation, includes:
analyzing query operation and determining query predicates;
analyzing the snapshot data, and determining partition column predicates of each partition-level metadata record and corresponding partition-level metadata;
determining a partition level metadata record based on the query predicate and the partition column predicate of the partition level metadata, and taking the determined partition level metadata record as a target partition;
analyzing the target partition, and determining partition column predicates of each file-level metadata record and corresponding file-level metadata;
determining a file-level metadata record based on the partition column predicate and the query predicate of the file-level metadata, and taking the determined file-level metadata record as a target file;
generating a first residual predicate based on the partition column predicate of the partition level metadata of the target partition, the partition column predicate of the file level metadata corresponding to the target file, and the query predicate;
analyzing the target file, and determining each group-level metadata and a second residual predicate of the corresponding group-level metadata;
determining grouping level metadata based on the first residual predicate and the second residual predicate; taking the group-level metadata as a target group;
analyzing the target group, and determining each row of data and a corresponding third residual predicate;
determining row data based on the third residual predicate and the first residual predicate; and taking the determined row data as final row data and outputting the final row data.
The working principle and the beneficial effects of the technical scheme are as follows:
as shown in fig. 6, after filtering the packet in the data file, opening the packet and obtaining all the data rows in the packet, and performing final row-level filtering according to predicates, where after the level filtering is completed, obtaining the data to be queried finally.
It should be noted that: this level pruning is also filtered using residual predicates, in effect to pass through: 42 < age < 45 and addr=jiangsu filters out the portion of the line data of age= 40,41,42,45 in packet 1 in file 4.
In addition, the multi-layer metadata organization structure of the table may not be completely matched with the condition of predicate filtering, and the effect is poor if the filtering is performed only according to addr columns, because partition layer pruning filtering is not applied to the table, when the table is built, the setting of corresponding partition columns is required according to the frequently used predicate filtering condition, a plurality of data files may contain overlapped data column min and max ranges, in this case, a plurality of data files may need to be opened to perform intra-file group scanning and intra-packet line record scanning, so that io consumption is increased, and therefore, the problem that the data columns min and max ranges in the plurality of data files in the partition are overlapped can be avoided by presetting the ordering mode of the data columns or adopting a batch processing task to perform intra-partition file rewriting mode; in the case of an equivalent query, etc., filtering of the equivalent query predicates may also be performed to generate residual predicates if the metadata statistics and the equivalent query match perfectly. If the query condition is 3 < id < 7 and 42 < age < 45 and addr=jiangsu, the final residual predicate can be generated by filtering out the partition column predicates 3 < id < 7 and the equivalent query predicate addr=jiangsu: 42 < age < 45 for more efficient pruning filtration.
Through the step-by-step pruning operation, step-by-step filtering is realized, a data organization mode based on multi-layer metadata is realized, and management and query of data are facilitated; the multi-layer pruning based on the metadata information is realized, and the efficient predicate filtering retrieval is performed; the generation of residual predicates and predicate filtering of grouping and grouping inner line records in the file are realized.
In one embodiment, obtaining snapshot data includes:
analyzing query operation and determining query predicates;
acquiring predicate sets corresponding to each snapshot data in a snapshot library;
when a query predicate exists in the predicate set, corresponding snapshot data is extracted.
The working principle and the beneficial effects of the technical scheme are as follows:
when snapshot data are stored in a snapshot library, a predicate set is built for every other snapshot data; the predicate set is a set of predicates corresponding to data corresponding to the snapshot; determining whether to extract the snapshot data by matching the query predicate with predicates in the predicate set; accurate retrieval of snapshot data is achieved.
In one embodiment, before outputting final row data meeting predicate conditions based on the snapshot data and the query operation, further comprising:
acquiring a historical query operation record of a user;
analyzing the historical query operation records, and determining a first association relation between query predicates in each historical query record and query predicates of the current query operation;
acquiring second association relations between snapshot data corresponding to query results corresponding to query predicates in each historical query record and other snapshot data in a snapshot library;
determining the priority of each obtained snapshot data based on the first association relationship and the second association relationship;
the query order of the acquired snapshot data is determined based on the order of priority from large to small.
The method comprises the steps of analyzing historical query operation records, determining a first association relation between query predicates in each historical query record and query predicates of current query operation, and comprising the following steps:
based on a preset quantization template, quantizing the time difference between each historical query record and the current moment to obtain a first association parameter;
calculating first similarity between query predicates in each historical query record and query predicates of the current query operation, and taking the first similarity as a second association parameter;
and taking the first association parameter and the second association parameter as a first association relation.
The method for determining the priority of each acquired snapshot data based on the first association relationship and the second association comprises the following steps:
inquiring a preset weight table based on the first association relation, and determining the weight of the query predicates corresponding to each historical query record; the first association parameter and the second association parameter in the weight table are associated with the weights in a one-to-one correspondence manner; the weight table is constructed by a professional through a large amount of data analysis in advance;
inquiring a preset association value table based on the second association relation, and determining association values between snapshot data where query results corresponding to each historical query record are located and the obtained snapshot data; the third association parameter, the fourth association parameter and the association value in the association value table are associated in one-to-one correspondence; the association value table is also constructed by a professional through a large amount of data analysis in advance;
based on the weights of the query predicates corresponding to each historical query record and the correlation values between the snapshot data where the query results corresponding to each historical query record are located and the obtained snapshot data, the priority of the snapshot data is calculated, and the calculation formula is as follows:
Figure SMS_7
;
in the method, in the process of the invention,
Figure SMS_8
indicating the priority of snapshot data, +.>
Figure SMS_9
Is->
Figure SMS_10
Weights of query predicates corresponding to the historical query records;
Figure SMS_11
is->
Figure SMS_12
And the historical query records correspond to the association value between the snapshot data of the query result and the obtained snapshot data.
The determining step of the association relation between each snapshot data in the snapshot library is as follows:
based on a preset quantization template, quantizing the time difference of the generation time of the two snapshot data of the association relation to be determined, and obtaining a third association parameter;
calculating second similarity of predicate sets of two snapshot data of the association relation to be determined, and taking the second similarity as a fourth association parameter; the similarity can be calculated by adopting a cosine similarity calculation method;
and taking the third association parameter and the fourth association parameter as the association relationship of the two snapshot data of the association relationship to be determined.
The working principle and the beneficial effects of the technical scheme are as follows:
the priority of each snapshot data is determined by comprehensively analyzing the historical query records, so that the current query result of the user is predicted based on the historical query records, and the retrieval efficiency is further improved.
The invention also provides a rapid retrieval system applied to the database, comprising:
the receiving module is used for receiving query operation with predicates of a user;
the acquisition module is used for acquiring snapshot data;
and the output module is used for outputting final row data conforming to predicate conditions based on the snapshot data and the query operation.
The snapshot data is constructed through the following steps:
after the table building operation is carried out, an initial snapshot is generated;
a snapshot is generated after each batch of data is written and committed.
Wherein, the snapshot includes: a number of partition level metadata records.
The partition level metadata record corresponds to a partition level metadata file, and the partition level metadata file is generated by statistics of file level metadata records in the partition level metadata file.
One file-level metadata record corresponds to one underlying data file.
In one embodiment, the output module outputs final row data meeting predicate conditions based on the snapshot data and the query operation, performing the following:
analyzing query operation and determining query predicates;
analyzing the snapshot data, and determining partition column predicates of each partition-level metadata record and corresponding partition-level metadata;
determining a partition level metadata record based on the query predicate and the partition column predicate of the partition level metadata, and taking the determined partition level metadata record as a target partition;
analyzing the target partition, and determining partition column predicates of each file-level metadata record and corresponding file-level metadata;
determining a file-level metadata record based on the partition column predicate and the query predicate of the file-level metadata, and taking the determined file-level metadata record as a target file;
generating a first residual predicate based on the partition column predicate of the partition level metadata of the target partition, the partition column predicate of the file level metadata corresponding to the target file, and the query predicate;
analyzing the target file, and determining each group-level metadata and a second residual predicate of the corresponding group-level metadata;
determining grouping level metadata based on the first residual predicate and the second residual predicate; taking the group-level metadata as a target group;
analyzing the target group, and determining each row of data and a corresponding third residual predicate;
determining row data based on the third residual predicate and the first residual predicate; and taking the determined row data as final row data and outputting the final row data.
In one embodiment, the acquisition module acquires snapshot data, performing the following:
analyzing query operation and determining query predicates;
acquiring predicate sets corresponding to each snapshot data in a snapshot library;
when a query predicate exists in the predicate set, corresponding snapshot data is extracted.
In one implementation, the fast retrieval system applied to the database further comprises: the sorting module is used for sorting the objects in the sequence,
the sorting module performs the following operations before the output module outputs final row data conforming to predicate conditions based on the snapshot data and the query operation:
acquiring a historical query operation record of a user;
analyzing the historical query operation records, and determining a first association relation between query predicates in each historical query record and query predicates of the current query operation;
acquiring second association relations between snapshot data corresponding to query results corresponding to query predicates in each historical query record and other snapshot data in a snapshot library;
determining the priority of each obtained snapshot data based on the first association relationship and the second association relationship;
the query order of the acquired snapshot data is determined based on the order of priority from large to small.
The method comprises the steps of analyzing historical query operation records, determining a first association relation between query predicates in each historical query record and query predicates of current query operation, and comprising the following steps:
based on a preset quantization template, quantizing the time difference between each historical query record and the current moment to obtain a first association parameter;
calculating first similarity between query predicates in each historical query record and query predicates of the current query operation, and taking the first similarity as a second association parameter;
and taking the first association parameter and the second association parameter as a first association relation.
The method for determining the priority of each acquired snapshot data based on the first association relationship and the second association comprises the following steps:
inquiring a preset weight table based on the first association relation, and determining the weight of the query predicates corresponding to each historical query record;
inquiring a preset association value table based on the second association relation, and determining association values between snapshot data where query results corresponding to each historical query record are located and the obtained snapshot data;
based on the weights of the query predicates corresponding to each historical query record and the correlation values between the snapshot data where the query results corresponding to each historical query record are located and the obtained snapshot data, the priority of the snapshot data is calculated, and the calculation formula is as follows:
Figure SMS_13
;
in the method, in the process of the invention,
Figure SMS_14
indicating the priority of snapshot data, +.>
Figure SMS_15
Is->
Figure SMS_16
Weights of query predicates corresponding to the historical query records;
Figure SMS_17
is->
Figure SMS_18
And the historical query records correspond to the association value between the snapshot data of the query result and the obtained snapshot data.
The determining step of the association relation between each snapshot data in the snapshot library is as follows:
based on a preset quantization template, quantizing the time difference of the generation time of the two snapshot data of the association relation to be determined, and obtaining a third association parameter;
calculating second similarity of predicate sets of two snapshot data of the association relation to be determined, and taking the second similarity as a fourth association parameter;
and taking the third association parameter and the fourth association parameter as the association relationship of the two snapshot data of the association relationship to be determined.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (6)

1. A rapid search method applied to a database, comprising:
receiving query operation with predicates of a user;
obtaining snapshot data;
based on the snapshot data and the query operation, outputting final row data conforming to predicate conditions;
the snapshot data is constructed through the following steps:
after the table building operation is carried out, an initial snapshot is generated;
generating a snapshot after each batch of data is written and submitted;
wherein the snapshot comprises: the partition-level metadata records correspond to a partition-level metadata file, the partition-level metadata file is generated by the statistical result of file-level metadata records in the partition-level metadata file, and one file-level metadata record corresponds to a data file of a bottom layer;
the snapshot data acquisition comprises the following steps:
analyzing the query operation and determining a query predicate;
acquiring predicate sets corresponding to each snapshot data in a snapshot library;
when the query predicate exists in the predicate set, extracting corresponding snapshot data;
the outputting final row data meeting predicate conditions based on the snapshot data and the query operation includes:
analyzing the query operation and determining a query predicate;
analyzing the snapshot data, and determining partition column predicates of each partition-level metadata record and corresponding partition-level metadata;
determining a partition level metadata record based on the query predicate and partition column predicates of the partition level metadata, and taking the determined partition level metadata record as a target partition;
analyzing the target partition, and determining partition column predicates of each file-level metadata record and corresponding file-level metadata;
determining a file-level metadata record based on the partition column predicate of the file-level metadata and the query predicate, and taking the determined file-level metadata record as a target file;
generating a first residual predicate based on the partition column predicate of the partition level metadata of the target partition, the partition column predicate of the file level metadata corresponding to the target file, and the query predicate;
analyzing the target file, and determining each group-level metadata and a second residual predicate of the corresponding group-level metadata;
determining packet-level metadata based on the first residual predicate and the second residual predicate; taking the group-level metadata as a target group;
analyzing the target group, and determining each row of data and a corresponding third residual predicate;
determining row data based on the third residual predicate and the first residual predicate; and taking the determined row data as final row data and outputting the final row data.
2. The quick search method applied to a database according to claim 1, further comprising, before outputting final row data conforming to a predicate condition based on the snapshot data and the query operation:
acquiring a historical query operation record of a user;
analyzing the historical query operation records, and determining a first association relation between query predicates in each historical query record and query predicates of the current query operation;
acquiring a second association relationship between snapshot data corresponding to query results corresponding to query predicates in each historical query record and other snapshot data in the snapshot library;
determining the priority of each obtained snapshot data based on the first association relationship and the second association relationship;
and determining the query sequence of the obtained snapshot data based on the sequence of the priorities from large to small.
3. The quick search method for a database of claim 2, wherein parsing the history query operation records, determining a first association of a query predicate in each history query record with a query predicate of a current query operation, comprises:
based on a preset quantization template, quantizing the time difference between each historical query record and the current moment to obtain a first association parameter;
calculating first similarity between query predicates in each historical query record and query predicates of current query operation, and taking the first similarity as a second association parameter;
and taking the first association parameter and the second association parameter as the first association relation.
4. The quick search method applied to a database according to claim 2, wherein determining the priority of each of the obtained snapshot data based on the first association relationship and the second association, comprises:
inquiring a preset weight table based on the first association relation, and determining weights of query predicates corresponding to each historical query record;
inquiring a preset association value table based on the second association relation, and determining association values between the snapshot data where the query results corresponding to each historical query record are located and the obtained snapshot data;
based on the weights of query predicates corresponding to each historical query record and the association values between the snapshot data where the query results corresponding to each historical query record are located and the obtained snapshot data, calculating the priority of the snapshot data, wherein the calculation formula is as follows:
Figure QLYQS_1
;
in the method, in the process of the invention,
Figure QLYQS_2
representing the snapshot countAccording to said priority->
Figure QLYQS_3
Is->
Figure QLYQS_4
Weights of query predicates corresponding to the historical query records; />
Figure QLYQS_5
Is->
Figure QLYQS_6
And the historical query records correspond to the association value between the snapshot data of the query result and the obtained snapshot data.
5. The quick search method for databases according to claim 2, wherein the step of determining the association relationship between each of the snapshot data in the snapshot library is as follows:
based on a preset quantization template, quantizing the time difference of the generation time of the two snapshot data of the association relation to be determined, and obtaining a third association parameter;
calculating second similarity of predicate sets of the two snapshot data of the association relation to be determined, and taking the second similarity as a fourth association parameter;
and taking the third association parameter and the fourth association parameter as association relations of the two snapshot data of the association relation to be determined.
6. A rapid retrieval system for use with a database, comprising:
the receiving module is used for receiving query operation with predicates of a user;
the acquisition module is used for acquiring snapshot data;
the output module is used for outputting final row data conforming to predicate conditions based on the snapshot data and the query operation;
the snapshot data is constructed through the following steps:
after the table building operation is carried out, an initial snapshot is generated;
generating a snapshot after each batch of data is written and submitted;
wherein the snapshot comprises: the partition-level metadata records correspond to a partition-level metadata file, the partition-level metadata file is generated by the statistical result of file-level metadata records in the partition-level metadata file, and one file-level metadata record corresponds to a data file of a bottom layer;
an acquisition module, configured to acquire snapshot data, including: analyzing the query operation and determining a query predicate;
acquiring predicate sets corresponding to each snapshot data in a snapshot library;
when the query predicate exists in the predicate set, extracting corresponding snapshot data;
the outputting final row data meeting predicate conditions based on the snapshot data and the query operation includes:
analyzing the query operation and determining a query predicate;
analyzing the snapshot data, and determining partition column predicates of each partition-level metadata record and corresponding partition-level metadata;
determining a partition level metadata record based on the query predicate and partition column predicates of the partition level metadata, and taking the determined partition level metadata record as a target partition;
analyzing the target partition, and determining partition column predicates of each file-level metadata record and corresponding file-level metadata;
determining a file-level metadata record based on the partition column predicate of the file-level metadata and the query predicate, and taking the determined file-level metadata record as a target file;
generating a first residual predicate based on the partition column predicate of the partition level metadata of the target partition, the partition column predicate of the file level metadata corresponding to the target file, and the query predicate;
analyzing the target file, and determining each group-level metadata and a second residual predicate of the corresponding group-level metadata;
determining packet-level metadata based on the first residual predicate and the second residual predicate; taking the group-level metadata as a target group;
analyzing the target group, and determining each row of data and a corresponding third residual predicate;
determining row data based on the third residual predicate and the first residual predicate; and taking the determined row data as final row data and outputting the final row data.
CN202310281123.6A 2023-03-22 2023-03-22 Quick search method and system applied to database Active CN116010668B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310281123.6A CN116010668B (en) 2023-03-22 2023-03-22 Quick search method and system applied to database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310281123.6A CN116010668B (en) 2023-03-22 2023-03-22 Quick search method and system applied to database

Publications (2)

Publication Number Publication Date
CN116010668A CN116010668A (en) 2023-04-25
CN116010668B true CN116010668B (en) 2023-06-20

Family

ID=86037667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310281123.6A Active CN116010668B (en) 2023-03-22 2023-03-22 Quick search method and system applied to database

Country Status (1)

Country Link
CN (1) CN116010668B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10127251B2 (en) * 2015-07-09 2018-11-13 International Business Machines Corporation Organizing on-disk layout of index structures to support historical keyword search queries over temporally evolving data
CN104933190B (en) * 2015-07-10 2018-04-17 上海新炬网络信息技术股份有限公司 A kind of SQL statement performs frequency dynamic adjusting method
US11468035B2 (en) * 2017-05-12 2022-10-11 Sap Se Constraint data statistics
US11086868B2 (en) * 2019-10-29 2021-08-10 Oracle International Corporation Materialized view rewrite technique for one-sided outer-join queries
CN113094340A (en) * 2021-04-28 2021-07-09 杭州海康威视数字技术股份有限公司 Data query method, device and equipment based on Hudi and storage medium

Also Published As

Publication number Publication date
CN116010668A (en) 2023-04-25

Similar Documents

Publication Publication Date Title
US7496584B2 (en) Incremental cardinality estimation for a set of data values
Dasu et al. Mining database structure; or, how to build a data quality browser
US5899986A (en) Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer
US8037059B2 (en) Implementing aggregation combination using aggregate depth lists and cube aggregation conversion to rollup aggregation for optimizing query processing
US7593931B2 (en) Apparatus, system, and method for performing fast approximate computation of statistics on query expressions
US7603370B2 (en) Method for duplicate detection and suppression
US20040249810A1 (en) Small group sampling of data for use in query processing
CN109117440B (en) Metadata information acquisition method, system and computer readable storage medium
CN110659282B (en) Data route construction method, device, computer equipment and storage medium
US7370030B2 (en) Method to provide management of query output
US8463763B2 (en) Method and tool for searching in several data sources for a selected community of users
CN107169003B (en) Data association method and device
WO2005065172A2 (en) Optimization for aggregate navigation for distinct count metrics
CN116010668B (en) Quick search method and system applied to database
CN110399396B (en) Efficient data processing
Xin et al. P-cube: Answering preference queries in multi-dimensional space
WO2008055202A2 (en) System and method for distributing queries to a group of databases and expediting data access
RU2396593C2 (en) Method for searching data on objects and in various databases and system for its realisation
US9378229B1 (en) Index selection based on a compressed workload
CN111143329B (en) Data processing method and device
CN112667859A (en) Data processing method and device based on memory
Devignes et al. BioRegistry: Automatic extraction of metadata for biological database retrieval and discovery
CN110704421A (en) Data processing method, device, equipment and computer readable storage medium
US20090112851A1 (en) Database management system, database management method and database management program
CN115905425A (en) Method and device for identifying data in excess period, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant