CN116010668B

CN116010668B - Quick search method and system applied to database

Info

Publication number: CN116010668B
Application number: CN202310281123.6A
Authority: CN
Inventors: 简勇华
Original assignee: Beijing Deepexi Technology Co Ltd
Current assignee: Beijing Deepexi Technology Co Ltd
Priority date: 2023-03-22
Filing date: 2023-03-22
Publication date: 2023-06-20
Anticipated expiration: 2043-03-22
Also published as: CN116010668A

Abstract

The invention provides a quick search method and a quick search system applied to a database, wherein the method comprises the following steps: receiving query operation with predicates of a user; obtaining snapshot data; based on the snapshot data and the query operation, final row data conforming to predicate conditions is output. Snapshot data is constructed by the steps of: after the table building operation is carried out, an initial snapshot is generated; a snapshot is generated after each batch of data is written and committed. Wherein, the snapshot includes: a number of partition level metadata records. The partition level metadata record corresponds to a partition level metadata file, and the partition level metadata file is generated by statistics of file level metadata records in the partition level metadata file. The method for quickly searching the database is applied to the data of the enterprise data lake according to the data file of the bottom layer corresponding to the file-level metadata record, so that the data of the enterprise data lake can be searched, and the data needed to be used can be quickly acquired.

Description

Quick search method and system applied to database

Technical Field

The invention relates to the technical field of data retrieval, in particular to a rapid retrieval method and system applied to a database.

Background

The advent of the concept of data lakes resulted from some challenges facing businesses as to how data should be handled and stored.

Initially, each application will generate and store a large amount of data that cannot be used by other applications, which results in the generation of islands of data. Data marts then develop, and the data generated by the applications is stored in a centralized data repository, from which relevant data can be exported as needed for transmission to departments or individuals within the enterprise that desire the data.

However, data marts solve only a portion of the problem. The remaining problems, including data management, data ownership, and access control, are all in need of resolution as businesses seek to gain higher capacity to use valid data.

In order to solve the above-mentioned problems, enterprises have a strong demand for building their own data lakes, which can store not only conventional types of data but also any other types of data, and further process and analyze them thereon to produce final outputs for consumption by various programs.

A data lake is a large repository that stores a wide variety of raw data for an enterprise, where the data is available for access, processing, analysis, and transmission.

The data lake obtains raw data from multiple data sources of the enterprise, and for different purposes, the same piece of raw data may also have multiple copies of data that satisfy a particular internal model format. Thus, the data processed in the data lake may be any type of information, from structured data to completely unstructured data.

In order to realize the use of data in a data lake, enterprise users need a quick search method to realize quick acquisition of the data to be used.

Disclosure of Invention

The invention aims at providing a rapid searching method applied to a database, which realizes the searching of data of an enterprise data lake and rapid acquisition of data needed to be used.

The embodiment of the invention provides a rapid search method applied to a database, which comprises the following steps:

receiving query operation with predicates of a user;

obtaining snapshot data;

based on the snapshot data and the query operation, outputting final row data conforming to predicate conditions;

the snapshot data is constructed through the following steps:

after the table building operation is carried out, an initial snapshot is generated;

a snapshot is generated after each batch of data is written and committed.

Wherein, the snapshot includes: a number of partition level metadata records.

The partition level metadata record corresponds to a partition level metadata file, and the partition level metadata file is generated by statistics of file level metadata records in the partition level metadata file.

One file-level metadata record corresponds to one underlying data file.

Preferably, based on the snapshot data and the query operation, outputting final row data conforming to predicate conditions includes:

analyzing query operation and determining query predicates;

analyzing the snapshot data, and determining partition column predicates of each partition-level metadata record and corresponding partition-level metadata;

determining a partition level metadata record based on the query predicate and the partition column predicate of the partition level metadata, and taking the determined partition level metadata record as a target partition;

analyzing the target partition, and determining partition column predicates of each file-level metadata record and corresponding file-level metadata;

determining a file-level metadata record based on the partition column predicate and the query predicate of the file-level metadata, and taking the determined file-level metadata record as a target file;

generating a first residual predicate based on the partition column predicate of the partition level metadata of the target partition, the partition column predicate of the file level metadata corresponding to the target file, and the query predicate;

analyzing the target file, and determining each group-level metadata and a second residual predicate of the corresponding group-level metadata;

determining grouping level metadata based on the first residual predicate and the second residual predicate; taking the group-level metadata as a target group;

analyzing the target group, and determining each row of data and a corresponding third residual predicate;

determining row data based on the third residual predicate and the first residual predicate; and taking the determined row data as final row data and outputting the final row data.

Preferably, obtaining snapshot data includes:

analyzing query operation and determining query predicates;

acquiring predicate sets corresponding to each snapshot data in a snapshot library;

when a query predicate exists in the predicate set, corresponding snapshot data is extracted.

Preferably, before outputting the final row data meeting the predicate condition based on the snapshot data and the query operation, the method further comprises:

acquiring a historical query operation record of a user;

analyzing the historical query operation records, and determining a first association relation between query predicates in each historical query record and query predicates of the current query operation;

acquiring second association relations between snapshot data corresponding to query results corresponding to query predicates in each historical query record and other snapshot data in a snapshot library;

determining the priority of each obtained snapshot data based on the first association relationship and the second association relationship;

the query order of the acquired snapshot data is determined based on the order of priority from large to small.

Preferably, analyzing the historical query operation records, determining a first association relationship between the query predicates in each historical query record and the query predicates of the current query operation, including:

based on a preset quantization template, quantizing the time difference between each historical query record and the current moment to obtain a first association parameter;

calculating first similarity between query predicates in each historical query record and query predicates of the current query operation, and taking the first similarity as a second association parameter;

and taking the first association parameter and the second association parameter as a first association relation.

Preferably, determining the priority of each obtained snapshot data based on the first association relationship and the second association relationship includes:

inquiring a preset weight table based on the first association relation, and determining the weight of the query predicates corresponding to each historical query record;

inquiring a preset association value table based on the second association relation, and determining association values between snapshot data where query results corresponding to each historical query record are located and the obtained snapshot data;

based on the weights of the query predicates corresponding to each historical query record and the correlation values between the snapshot data where the query results corresponding to each historical query record are located and the obtained snapshot data, the priority of the snapshot data is calculated, and the calculation formula is as follows:

;

in the method, in the process of the invention,

indicating the priority of snapshot data, +.>

Is->

Weights of query predicates corresponding to the historical query records;

is->

And the historical query records correspond to the association value between the snapshot data of the query result and the obtained snapshot data.

Preferably, the determining step of the association relation between each snapshot data in the snapshot library is as follows:

based on a preset quantization template, quantizing the time difference of the generation time of the two snapshot data of the association relation to be determined, and obtaining a third association parameter;

calculating second similarity of predicate sets of two snapshot data of the association relation to be determined, and taking the second similarity as a fourth association parameter;

and taking the third association parameter and the fourth association parameter as the association relationship of the two snapshot data of the association relationship to be determined.

The invention also provides a rapid retrieval system applied to the database, comprising:

the receiving module is used for receiving query operation with predicates of a user;

the acquisition module is used for acquiring snapshot data;

the output module is used for outputting final row data conforming to predicate conditions based on the snapshot data and the query operation;

the snapshot data is constructed through the following steps:

a snapshot is generated after each batch of data is written and committed.

Wherein, the snapshot includes: a number of partition level metadata records.

One file-level metadata record corresponds to one underlying data file.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a schematic diagram of a fast search method applied to a database according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating snapshot generation according to an embodiment of the present invention;

FIG. 3 is a first level block diagram of a snapshot in an embodiment of the present invention;

FIG. 4 is a first level block diagram of yet another snapshot in an embodiment of the present invention;

FIG. 5 is a schematic diagram of intra-file grouping in an embodiment of the invention;

FIG. 6 is a diagram illustrating a search step according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

The embodiment of the invention provides a quick search method applied to a database, which is shown in fig. 1 and comprises the following steps:

step S1: receiving query operation with predicates of a user;

step S2: obtaining snapshot data;

step S3: based on the snapshot data and the query operation, final row data conforming to predicate conditions is output.

The snapshot data is constructed through the following steps:

a snapshot is generated after each batch of data is written and committed. As shown in fig. 2, after performing a table building operation, an initial snapshot with metadata such as table structure information, table partition information, table attribute fields, etc. is generated; generating a first snapshot of the table after a batch of data is written and submitted, and generating a second snapshot … of the table after a second batch of data is written and submitted is based on the snapshot being submitted as a basic storage method for performing multi-layer pruning retrieval.

Wherein, the snapshot includes: a number of partition level metadata records. The snapshot includes a number of partition level metadata records: the metadata of the close partitions are organized in the same partition-level metadata (the metadata comprises a plurality of adjacent partitions), and the min, max and null values of partition columns are recorded and used for filtering related predicates of the partition columns in query predicates. As shown in fig. 3, a table t1 is created, which includes three columns (id, int), (age, int), (addr, varchar), the id column is used as a partition column, after a plurality of records with id=1 to 9 are inserted, three partition-level metadata records are recorded in the snapshot, and the three partition-level metadata records correspond to 1 to 3 partitions, 4 to 6 partitions and 7 to 9 partitions respectively; after querying using predicates such as 3 < id < 7, we can locate the second partition level metadata record by the id column min, max value in the partition level metadata record, thus completing the partition level pruning operation.

The partition level metadata record corresponds to a partition level metadata file, and the partition level metadata file is generated by statistics of file level metadata records in the partition level metadata file. The file-level metadata record contains relevant statistical information of actually generated data files, including min, max, null values and the like of each non-partition column in a table in a certain data file, and is used for filtering data column related predicates in query predicates. As shown in FIG. 4, after querying using predicates like 3 < id < 7 and 42 < age < 45, we first pass through the partition column in the partition level metadata record: the min and max values of the id columns can be positioned to the second partition level metadata record, then partition level metadata files corresponding to the partition level metadata record are opened, file level filtering is carried out through file level metadata information in the partition level metadata record, and the non-partition columns of the files 4 recorded in the file level metadata record can be seen: the min, max value (41, 50) of the age column, whereby we successfully performed predicate filtering of the non-partitioned column by using the file-level metadata record, completed the file-level pruning operation.

One file-level metadata record corresponds to one underlying data file. The data files can be stored in groups, and statistical meta-information in the groups is generated, after the file-level record information is filtered based on the steps, the corresponding data files are opened, the meta-information of each group is obtained, and the grouping filtering is carried out according to predicates. As shown in fig. 5: when we perform predicate filtering such as 3 < id < 7, 42 < age < 45, and addr=jiangsu, packet 1 in file 4 can be filtered out by reading statistical information of the packets in file, thereby completing filtering at the level of packets in file. It should be noted that: this level of pruning we use the residual predicate to filter (the residual predicate is a filtering condition after the query predicate removes the partition column predicate), and partition column predicates 3 < id < 7 in the predicate are no longer needed because the data lines in a file all belong to a partition.

In one embodiment, outputting final row data that meets predicate conditions based on snapshot data and a query operation, includes:

analyzing query operation and determining query predicates;

The working principle and the beneficial effects of the technical scheme are as follows:

as shown in fig. 6, after filtering the packet in the data file, opening the packet and obtaining all the data rows in the packet, and performing final row-level filtering according to predicates, where after the level filtering is completed, obtaining the data to be queried finally.

It should be noted that: this level pruning is also filtered using residual predicates, in effect to pass through: 42 < age < 45 and addr=jiangsu filters out the portion of the line data of age= 40,41,42,45 in packet 1 in file 4.

In addition, the multi-layer metadata organization structure of the table may not be completely matched with the condition of predicate filtering, and the effect is poor if the filtering is performed only according to addr columns, because partition layer pruning filtering is not applied to the table, when the table is built, the setting of corresponding partition columns is required according to the frequently used predicate filtering condition, a plurality of data files may contain overlapped data column min and max ranges, in this case, a plurality of data files may need to be opened to perform intra-file group scanning and intra-packet line record scanning, so that io consumption is increased, and therefore, the problem that the data columns min and max ranges in the plurality of data files in the partition are overlapped can be avoided by presetting the ordering mode of the data columns or adopting a batch processing task to perform intra-partition file rewriting mode; in the case of an equivalent query, etc., filtering of the equivalent query predicates may also be performed to generate residual predicates if the metadata statistics and the equivalent query match perfectly. If the query condition is 3 < id < 7 and 42 < age < 45 and addr=jiangsu, the final residual predicate can be generated by filtering out the partition column predicates 3 < id < 7 and the equivalent query predicate addr=jiangsu: 42 < age < 45 for more efficient pruning filtration.

Through the step-by-step pruning operation, step-by-step filtering is realized, a data organization mode based on multi-layer metadata is realized, and management and query of data are facilitated; the multi-layer pruning based on the metadata information is realized, and the efficient predicate filtering retrieval is performed; the generation of residual predicates and predicate filtering of grouping and grouping inner line records in the file are realized.

In one embodiment, obtaining snapshot data includes:

analyzing query operation and determining query predicates;

when snapshot data are stored in a snapshot library, a predicate set is built for every other snapshot data; the predicate set is a set of predicates corresponding to data corresponding to the snapshot; determining whether to extract the snapshot data by matching the query predicate with predicates in the predicate set; accurate retrieval of snapshot data is achieved.

In one embodiment, before outputting final row data meeting predicate conditions based on the snapshot data and the query operation, further comprising:

acquiring a historical query operation record of a user;

The method comprises the steps of analyzing historical query operation records, determining a first association relation between query predicates in each historical query record and query predicates of current query operation, and comprising the following steps:

The method for determining the priority of each acquired snapshot data based on the first association relationship and the second association comprises the following steps:

inquiring a preset weight table based on the first association relation, and determining the weight of the query predicates corresponding to each historical query record; the first association parameter and the second association parameter in the weight table are associated with the weights in a one-to-one correspondence manner; the weight table is constructed by a professional through a large amount of data analysis in advance;

inquiring a preset association value table based on the second association relation, and determining association values between snapshot data where query results corresponding to each historical query record are located and the obtained snapshot data; the third association parameter, the fourth association parameter and the association value in the association value table are associated in one-to-one correspondence; the association value table is also constructed by a professional through a large amount of data analysis in advance;

;

in the method, in the process of the invention,

indicating the priority of snapshot data, +.>

Is->

Weights of query predicates corresponding to the historical query records;

is->

The determining step of the association relation between each snapshot data in the snapshot library is as follows:

calculating second similarity of predicate sets of two snapshot data of the association relation to be determined, and taking the second similarity as a fourth association parameter; the similarity can be calculated by adopting a cosine similarity calculation method;

the priority of each snapshot data is determined by comprehensively analyzing the historical query records, so that the current query result of the user is predicted based on the historical query records, and the retrieval efficiency is further improved.

the acquisition module is used for acquiring snapshot data;

and the output module is used for outputting final row data conforming to predicate conditions based on the snapshot data and the query operation.

The snapshot data is constructed through the following steps:

a snapshot is generated after each batch of data is written and committed.

Wherein, the snapshot includes: a number of partition level metadata records.

One file-level metadata record corresponds to one underlying data file.

In one embodiment, the output module outputs final row data meeting predicate conditions based on the snapshot data and the query operation, performing the following:

analyzing query operation and determining query predicates;

In one embodiment, the acquisition module acquires snapshot data, performing the following:

analyzing query operation and determining query predicates;

In one implementation, the fast retrieval system applied to the database further comprises: the sorting module is used for sorting the objects in the sequence,

the sorting module performs the following operations before the output module outputs final row data conforming to predicate conditions based on the snapshot data and the query operation:

acquiring a historical query operation record of a user;

;

in the method, in the process of the invention,

indicating the priority of snapshot data, +.>

Is->

Weights of query predicates corresponding to the historical query records;

is->

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A rapid search method applied to a database, comprising:

receiving query operation with predicates of a user;

obtaining snapshot data;

the snapshot data is constructed through the following steps:

generating a snapshot after each batch of data is written and submitted;

wherein the snapshot comprises: the partition-level metadata records correspond to a partition-level metadata file, the partition-level metadata file is generated by the statistical result of file-level metadata records in the partition-level metadata file, and one file-level metadata record corresponds to a data file of a bottom layer;

the snapshot data acquisition comprises the following steps:

analyzing the query operation and determining a query predicate;

when the query predicate exists in the predicate set, extracting corresponding snapshot data;

the outputting final row data meeting predicate conditions based on the snapshot data and the query operation includes:

analyzing the query operation and determining a query predicate;

determining a partition level metadata record based on the query predicate and partition column predicates of the partition level metadata, and taking the determined partition level metadata record as a target partition;

determining a file-level metadata record based on the partition column predicate of the file-level metadata and the query predicate, and taking the determined file-level metadata record as a target file;

determining packet-level metadata based on the first residual predicate and the second residual predicate; taking the group-level metadata as a target group;

2. The quick search method applied to a database according to claim 1, further comprising, before outputting final row data conforming to a predicate condition based on the snapshot data and the query operation:

acquiring a historical query operation record of a user;

acquiring a second association relationship between snapshot data corresponding to query results corresponding to query predicates in each historical query record and other snapshot data in the snapshot library;

and determining the query sequence of the obtained snapshot data based on the sequence of the priorities from large to small.

3. The quick search method for a database of claim 2, wherein parsing the history query operation records, determining a first association of a query predicate in each history query record with a query predicate of a current query operation, comprises:

calculating first similarity between query predicates in each historical query record and query predicates of current query operation, and taking the first similarity as a second association parameter;

and taking the first association parameter and the second association parameter as the first association relation.

4. The quick search method applied to a database according to claim 2, wherein determining the priority of each of the obtained snapshot data based on the first association relationship and the second association, comprises:

inquiring a preset weight table based on the first association relation, and determining weights of query predicates corresponding to each historical query record;

inquiring a preset association value table based on the second association relation, and determining association values between the snapshot data where the query results corresponding to each historical query record are located and the obtained snapshot data;

based on the weights of query predicates corresponding to each historical query record and the association values between the snapshot data where the query results corresponding to each historical query record are located and the obtained snapshot data, calculating the priority of the snapshot data, wherein the calculation formula is as follows:

;

in the method, in the process of the invention,

representing the snapshot countAccording to said priority->

Is->

Weights of query predicates corresponding to the historical query records; />

Is->

5. The quick search method for databases according to claim 2, wherein the step of determining the association relationship between each of the snapshot data in the snapshot library is as follows:

calculating second similarity of predicate sets of the two snapshot data of the association relation to be determined, and taking the second similarity as a fourth association parameter;

and taking the third association parameter and the fourth association parameter as association relations of the two snapshot data of the association relation to be determined.

6. A rapid retrieval system for use with a database, comprising:

the acquisition module is used for acquiring snapshot data;

the snapshot data is constructed through the following steps:

generating a snapshot after each batch of data is written and submitted;

an acquisition module, configured to acquire snapshot data, including: analyzing the query operation and determining a query predicate;

analyzing the query operation and determining a query predicate;