CN108416027B - Merged data fragmentation optimization method based on range query boundary set - Google Patents
Merged data fragmentation optimization method based on range query boundary set Download PDFInfo
- Publication number
- CN108416027B CN108416027B CN201810194425.9A CN201810194425A CN108416027B CN 108416027 B CN108416027 B CN 108416027B CN 201810194425 A CN201810194425 A CN 201810194425A CN 108416027 B CN108416027 B CN 108416027B
- Authority
- CN
- China
- Prior art keywords
- data
- cost
- query
- slice
- deviation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a bottom-up merging data slicing optimization method based on a range query boundary set, which is characterized by comprising the following steps of: 1) establishing a data access probability model under a range query load; 2) initializing a fragmentation scheme P by using a range query boundary set; 3) calculating the cost deviation F caused by merging two adjacent data slicesc(ii) a 4) Traversing the cost deviation array, finding out the minimum cost deviation, and combining the two adjacent data pieces; 5) updating two cost deviation values in the cost deviation array combined and influenced in the step 4), and recalculating Fc(ii) a 6) Jump to step 4), and loop through the data slice merging until the optimal number of data slices is reached. The method reduces the management and maintenance cost of the data, obtains the optimal data query cost, and improves the query efficiency.
Description
Technical Field
The invention relates to a data fragmentation optimization technology under a range query load with a tilt characteristic oriented to big data, in particular to a bottom-up merging data fragmentation optimization method based on a range query boundary set.
Background
The data have an association relation, the data inclination means that the association of the data has a certain mode, and finding and utilizing the association mode between the data is an effective method for query optimization. Under a skewed range query load, some consecutive records are often hit by a range query at the same time on some property of the data. From the data management perspective, the records which are frequently hit simultaneously can be regarded as a whole, and are identified by one piece of metadata, and the records are read or skipped integrally during query, so that the management and maintenance cost of the plurality of records can be greatly reduced. In order to obtain optimal range query performance, the optimal slicing position for slicing data must be on the boundary of the range query, because neighboring data that are never sliced by the range query should be considered as a whole and exist in the same data slice.
Disclosure of Invention
The invention aims to provide an efficient optimization slicing method for a data set aiming at the defects of the prior art. The method is based on the range query boundary set as the data initialization fragmentation, and the optimal fragmentation of the data is efficiently realized through bottom-to-top combination, so that the management and maintenance cost of the data, the positioning addressing cost and the transmission cost in the data query can be reduced, and the query efficiency is improved.
The technical scheme for realizing the purpose of the invention is as follows:
a bottom-up merging data slicing optimization method based on a range query boundary set is different from the prior art and comprises the following steps:
1) establishing a data access probability model under a range query load: the set formed by all boundaries defining range query on the data set is called a range query boundary set, under the record-based data organization mode, the query cumulative probability of one data record is the number of times that the data record is accessed by the query load/the total query number, and under the data organization mode based on the data slice, the kth data slice DS is definedkHas a length of lkData slice DSkHas a cumulative probability of PkDue to the fact that for the data slice DSkThe access to the arbitrary record is embodied as a DS to a slice of datakThus the data slice DSkQuery cumulative probability PkValue is DSkMaximum value of query cumulative probability of data record contained, data slice DSkThe cost of the query above is expressed as:
DSkthe query cost is the positioning addressing cost plus the data transmission cost
Disk address cost per location sxdskQuery cumulative probability of Pk+ length of data slice lkX Transmission cost per byte data x DSkQuery cumulative probability of PkAfter data is fragmented, there may be a case of query "false hit", that is, a part of data in the fragment is not a query result set but is accessed, which brings extra transmission overhead, and this part of extra transmission overhead is defined as cost bias, and is denoted by FcThe coarser the granularity of the data fragments, the smaller the positioning addressing cost of data query, and the larger the deviation of the data transmission cost, the larger the data transmission costThe larger the data transmission cost is, and vice versa, the finer the data fragment granularity is, the larger the positioning addressing cost of data query is, and the smaller the data transmission cost is, that is, the positioning addressing cost and the data transmission cost are two mutually restricted indexes, so that the data fragmentation problem is an optimization problem under the query workload in the inclined range;
2) initializing a fragmentation scheme P with a range query boundary set: assuming there are B different elements in the range query boundary set, the data set is initialized into B-1 pieces of data;
3) calculating the cost deviation F caused by merging two adjacent data slicesc(DS1,DS2),FC(DS2,DS3)…, Fc(DSi-1,DSi),Fc(DSi,DSi+1),…,Fc(DSB-2,DSB-1);
4) And traversing the cost deviation array, finding out the minimum cost deviation, and combining the two adjacent data slices: suppose Fc(DSi,DSi+1) If the cost in the cost deviation is the minimum, merging the data slice DSiAnd DSi+1And the merged data slice is: DS (direct sequence)1,…,DSi,DSi+2,…,DSB-1;
5) Updating two cost deviation values in the cost deviation array combined and influenced in the step 4), and recalculating Fc(ii) a E.g. merging of data slices DSiAnd DSi+1Is a new DSiWhen F needs to be recalculatedc(DSi-1,DSi) And Fc(DSi+1,DSi+2);
6) Jump to step 4), and circularly execute data slice merging until the data optimal slice number is reached: cost deviation Fc(DSi,DSi+1) May be within a constant time, i.e.Internally completing, B-1 cost deviations need to be executed in the first round of circulation, and the adjacent merged data pieces need to be calculated in each round laterAnd 2 cost deviations, wherein B-K rounds are required to be executed until the number of the remaining data pieces is K, and the total calculation cost is (B-1) +2(B-K), wherein B is the base number of the range query boundary set, and K is the number of the data pieces of the fragments.
From the step 6), the total calculation cost is (B-1) +2(B-K), constant zero is omitted, and the time complexity of the method of the technical scheme is
Under the workload of the slant range query, the data fragmentation should adapt to the access mode of the range query as much as possible to reduce the data transmission cost deviation. If the slice position of the data slice is not on the range query boundary, meaningless transmission cost deviation is brought, so that the optimal slice position of the data is always on the range query boundary, and based on the conclusion, the slice position of the data slice is only searched on the boundary point of the range query, namely, the data slice optimization method based on the range query boundary set, namely, in step 2) of the technical scheme, firstly, the basis of initializing the data slice by using the range query boundary set, then, the adjacent data slices are combined iteratively, and the adjacent data slices generating the minimum cost deviation are selected to be combined every time.
The method reduces the management and maintenance cost of the data, obtains the optimal data query cost, and improves the query efficiency.
Drawings
Fig. 1 is a schematic diagram of an embodiment in which the optimal slicing position of data is necessarily located on the boundary of a range query.
Detailed Description
The invention will be further illustrated, but not limited, by the following description of the embodiments with reference to the accompanying drawings.
Example (b):
a bottom-up merging data slicing optimization method based on a range query boundary set is different from the prior art and comprises the following steps:
1) establishing a data access probability model under a range query load: statorDefining a set formed by all boundaries of range query on a data set as a range query boundary set, wherein in a record-based data organization mode, the query cumulative probability of a data record is the number of times that the data record is accessed by a query load/the total query number, and in a data organization mode based on data slices, defining the kth data slice DSkHas a length of lkData slice DSkHas a cumulative probability of PkDue to the fact that for the data slice DSkThe access to the arbitrary record is embodied as a DS to a slice of datakThus the data slice DSkQuery cumulative probability PkValue is DSkMaximum value of query cumulative probability of data record contained, data slice DSkThe cost of the query above is expressed as:
DSkthe query cost is the positioning addressing cost plus the data transmission cost
Disk address cost per location sxdskQuery cumulative probability of Pk+ length of data slice lkX Transmission cost per byte data x DSkQuery cumulative probability of PkAfter data is fragmented, there may be a case of query "false hit", that is, a part of data in the fragment is not a query result set but is accessed, which brings extra transmission overhead, and this part of extra transmission overhead is defined as cost bias, and is denoted by FcThe method comprises the following steps that the data fragmentation granularity is thicker, the positioning addressing cost of data query is smaller, the data transmission cost deviation is larger, the data transmission cost is larger, and vice versa, the data fragmentation granularity is thinner, the positioning addressing cost of data query is larger, the data transmission cost is smaller, namely, the positioning addressing cost and the data transmission cost are two mutually restricted indexes, so that the data fragmentation problem is an optimization problem under the query working load in an inclined range;
2) initializing a fragmentation scheme P with a range query boundary set: assuming there are B different elements in the range query boundary set, the data set is initialized into B-1 pieces of data;
3) calculating the cost deviation F caused by merging two adjacent data slicesc(DS1,DS2),FC(DS2,DS3)…, Fc(DSi-1,DSi),Fc(DSi,DSi+1),…,Fc(DSB-2,DSB-1);
4) And traversing the cost deviation array, finding out the minimum cost deviation, and combining the two adjacent data slices: suppose Fc(DSi,DSi+1) If the cost in the cost deviation is the minimum, merging the data slice DSiAnd DSi+1And the merged data slice is: DS (direct sequence)1,…,DSi,DSi+2,…,DSB-1;
5) Updating two cost deviation values in the cost deviation array combined and influenced in the step 4), and recalculating Fc(ii) a E.g. merging of data slices DSiAnd DSi+1Is a new DSiWhen F needs to be recalculatedc(DSi-1,DSi) And Fc(DSi+1,DSi+2);
6) Jump to step 4), and circularly execute data slice merging until the data optimal slice number is reached: cost deviation Fc(DSi,DSi+1) May be within a constant time, i.e.And internally completing, wherein B-1 cost deviations need to be executed in the first round of circulation, next 2 cost deviations adjacent to the merged data piece need to be calculated in each round, B-K rounds need to be executed until the number of the remaining data pieces is K, and the total calculation cost is (B-1) +2(B-K), wherein B is the base number of the range query boundary set, and K is the number of the data pieces of the fragments.
The time complexity of the method of the present embodiment was evaluated: from the step 6), the total calculation cost is (B-1) +2(B-K), constant zero is omitted, and the time complexity of the algorithm is
After the data is divided into data slices, the query cumulative probability distribution on the data slices is a fit to the range query cumulative probability distribution, and the fit is biased, called a fit cost bias, which increases the range query cost on the data slices, as shown in fig. 1, and the area of the shaded portion in fig. 1 is the fit cost bias caused by the data slices.
When the requirement of the inclined range query is met, the access mode of the range query needs to be sensed, and based on the sensing, the data is divided into data slices, and the data with strong relevance in the access mode is divided into one data slice. The data slice model based on the association perception can enable the data slices to be hit in a full or large ratio when being queried and accessed in a range, and reduce the transmission cost deviation of data, so that the query efficiency is improved.
Under the inclined range query workload, the data fragment should adapt to the access mode of the range query as much as possible to reduce the transmission cost deviation in the data query, minimize the range query cost on the data set, obtain the optimal query performance, and in order to reduce the DSkQuery cumulative probability of PkThe optimal sliced slice position of the data must fall on the bounds query boundary, as shown in FIG. 1, if the slice position of the data slice does not fall on the bounds query boundary, e.g., b'2Then [ b'2,b2]The data in between are divided into DSs3In the data sheet, [ b'2,b2]The query cumulative probability of the data in between increases and the query cost increases, so the data slicing scheme with slice positions falling on the query boundary of the range, the query cumulative probability distribution of the data slices and the query load probability distribution on the data set are best fitted.
The method presented in this example aims to minimize the fitting cost bias to optimize the range query performance on the data set.
The embodiment initializes the data set into a plurality of data slices, and then iteratively merges adjacent data slices, and each merging selects the merging of the adjacent data slices which generates the minimum cost deviation.
Claims (1)
1. A bottom-up merging data slicing optimization method based on a range query boundary set is characterized by comprising the following steps:
1) establishing a data access probability model under a range query load: the set formed by all boundaries defining range query on the data set is called a range query boundary set, under the record-based data organization mode, the query cumulative probability of one data record is the number of times that the data record is accessed by the query load/the total query number, and under the data organization mode based on the data slice, the kth data slice DS is definedkHas a length of lkData slice DSkHas a cumulative probability of PkData slice DSkQuery cumulative probability PkValue is DSkMaximum value of query cumulative probability of data record contained, data slice DSkThe cost of the query above is expressed as:
DSkthe query cost is the positioning addressing cost plus the data transmission cost
Disk address cost per location sxdskQuery cumulative probability of Pk+ length of data slice lkX Transmission cost per byte data x DSkQuery cumulative probability of PkAfter data fragmentation, there is a case of query "false hit", that is, the partial data in the fragment is not the query result set but will be accessed, which brings extra transmission overhead, and this part of extra transmission overhead is defined as cost bias, and is denoted by FcRepresents;
2) initializing a fragmentation scheme P with a range query boundary set: assuming there are B different elements in the range query boundary set, the data set is initialized into B-1 pieces of data;
3) calculating the cost deviation F caused by merging two adjacent data slicesc(DS1,DS2),Fc(DS2,DS3)…,Fc(DSi-1,DSi),Fc(DSi,DSi+1),…,Fc(DSB-2,DSB-1);
4) And traversing the cost deviation array, finding out the minimum cost deviation, and combining the two adjacent data slices: suppose Fc(DSi,DSi+1) The cost is the minimum in the cost deviation, thenAnd data slice DSiAnd DSi+1And the merged data slice is: DS (direct sequence)1,…,DSi,DSi+2,…,DSB-1;
5) Updating two cost deviation values in the cost deviation array combined and influenced in the step 4), and recalculating Fc;
6) Jump to step 4), and circularly execute data slice merging until the data optimal slice number is reached: cost deviation Fc(DSi,DSi+1) Within a constant time, i.e.And internally completing, wherein B-1 cost deviations need to be executed in the first round of circulation, next 2 cost deviations adjacent to the merged data piece need to be calculated in each round, B-K rounds need to be executed until the number of the remaining data pieces is K, and the total calculation cost is (B-1) +2(B-K), wherein B is the base number of the range query boundary set, and K is the number of the data pieces of the fragments.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810194425.9A CN108416027B (en) | 2018-03-09 | 2018-03-09 | Merged data fragmentation optimization method based on range query boundary set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810194425.9A CN108416027B (en) | 2018-03-09 | 2018-03-09 | Merged data fragmentation optimization method based on range query boundary set |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108416027A CN108416027A (en) | 2018-08-17 |
CN108416027B true CN108416027B (en) | 2021-07-20 |
Family
ID=63130829
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810194425.9A Expired - Fee Related CN108416027B (en) | 2018-03-09 | 2018-03-09 | Merged data fragmentation optimization method based on range query boundary set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108416027B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101256579A (en) * | 2008-04-08 | 2008-09-03 | 中兴通讯股份有限公司 | Method for inquesting data organization in database |
CN104516906A (en) * | 2013-09-29 | 2015-04-15 | 日电(中国)有限公司 | Adaptive indexing method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8543579B2 (en) * | 2005-06-17 | 2013-09-24 | International Business Machines Corporation | Range query methods and apparatus |
-
2018
- 2018-03-09 CN CN201810194425.9A patent/CN108416027B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101256579A (en) * | 2008-04-08 | 2008-09-03 | 中兴通讯股份有限公司 | Method for inquesting data organization in database |
CN104516906A (en) * | 2013-09-29 | 2015-04-15 | 日电(中国)有限公司 | Adaptive indexing method and device |
Non-Patent Citations (2)
Title |
---|
A distributed range query framework for the internet of things;Congcong Zhang; Tingting Zhang; Mei Wang;《2015 18th International Conference on Intelligence in Next Generation Networks》;20150402;全文 * |
HiBase:一种基于分层式索引的高效HBase查询技术与系统;葛微,罗圣美,周文辉;《计算机学报》;20160131;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN108416027A (en) | 2018-08-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8825629B2 (en) | Method for index tuning of a SQL statement, and index merging for a multi-statement SQL workload, using a cost-based relational query optimizer | |
CN111868710B (en) | Random extraction forest index structure for searching large-scale unstructured data | |
US20130124534A1 (en) | Apparatus and method for information access, search, rank and retrieval | |
US10394787B2 (en) | Indexing methods and systems for spatial data objects | |
KR100284778B1 (en) | Insertion method of high dimensional index structure for content-based image retrieval | |
US9396247B2 (en) | Method and device for processing a time sequence based on dimensionality reduction | |
US8583657B2 (en) | Method and apparatus for using a hash-partitioned index to access a table that is not partitioned or partitioned independently of the hash partitioned index | |
JP2007521565A (en) | Multidimensional data object search using bit vector index | |
US10915534B2 (en) | Extreme value computation | |
US8661040B2 (en) | Grid-based data clustering method | |
US10452658B2 (en) | Caching methods and a system for entropy-based cardinality estimation | |
Chen et al. | Efficiently evaluating skyline queries on RDF databases | |
CN108416027B (en) | Merged data fragmentation optimization method based on range query boundary set | |
CN108920631B (en) | File query method, device, equipment and readable storage medium | |
CN112328587A (en) | Data processing method and device for ElasticSearch | |
CN108460137B (en) | Range query data fragmentation optimization method based on combined deviation threshold | |
CN112860734B (en) | Multi-dimensional range query method and device for seismic data | |
CN107688620B (en) | Top-k query-oriented method for instantly diversifying query results | |
JP4091586B2 (en) | Structured document management system, index construction method and program | |
US20150220596A1 (en) | Dynamically adjust duplicate skipping method for increased performance | |
Dervos et al. | S-index: a hybrid structure for text retrieval | |
Gandhi et al. | Experiments on Static Data Summarization Techniques | |
CN108427747B (en) | Dynamic planning data fragmentation optimization method based on range query boundary set | |
US20170083567A1 (en) | High-dimensional data storage and retrieval | |
CN107944038A (en) | A kind of generation method and device of duplicate removal data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210720 |
|
CF01 | Termination of patent right due to non-payment of annual fee |