CN108416027B - Merged data fragmentation optimization method based on range query boundary set - Google Patents

Merged data fragmentation optimization method based on range query boundary set Download PDF

Info

Publication number
CN108416027B
CN108416027B CN201810194425.9A CN201810194425A CN108416027B CN 108416027 B CN108416027 B CN 108416027B CN 201810194425 A CN201810194425 A CN 201810194425A CN 108416027 B CN108416027 B CN 108416027B
Authority
CN
China
Prior art keywords
data
cost
query
slice
deviation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810194425.9A
Other languages
Chinese (zh)
Other versions
CN108416027A (en
Inventor
葛微
李先贤
王金艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN201810194425.9A priority Critical patent/CN108416027B/en
Publication of CN108416027A publication Critical patent/CN108416027A/en
Application granted granted Critical
Publication of CN108416027B publication Critical patent/CN108416027B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Abstract

The invention discloses a bottom-up merging data slicing optimization method based on a range query boundary set, which is characterized by comprising the following steps of: 1) establishing a data access probability model under a range query load; 2) initializing a fragmentation scheme P by using a range query boundary set; 3) calculating the cost deviation F caused by merging two adjacent data slicesc(ii) a 4) Traversing the cost deviation array, finding out the minimum cost deviation, and combining the two adjacent data pieces; 5) updating two cost deviation values in the cost deviation array combined and influenced in the step 4), and recalculating Fc(ii) a 6) Jump to step 4), and loop through the data slice merging until the optimal number of data slices is reached. The method reduces the management and maintenance cost of the data, obtains the optimal data query cost, and improves the query efficiency.

Description

Merged data fragmentation optimization method based on range query boundary set
Technical Field
The invention relates to a data fragmentation optimization technology under a range query load with a tilt characteristic oriented to big data, in particular to a bottom-up merging data fragmentation optimization method based on a range query boundary set.
Background
The data have an association relation, the data inclination means that the association of the data has a certain mode, and finding and utilizing the association mode between the data is an effective method for query optimization. Under a skewed range query load, some consecutive records are often hit by a range query at the same time on some property of the data. From the data management perspective, the records which are frequently hit simultaneously can be regarded as a whole, and are identified by one piece of metadata, and the records are read or skipped integrally during query, so that the management and maintenance cost of the plurality of records can be greatly reduced. In order to obtain optimal range query performance, the optimal slicing position for slicing data must be on the boundary of the range query, because neighboring data that are never sliced by the range query should be considered as a whole and exist in the same data slice.
Disclosure of Invention
The invention aims to provide an efficient optimization slicing method for a data set aiming at the defects of the prior art. The method is based on the range query boundary set as the data initialization fragmentation, and the optimal fragmentation of the data is efficiently realized through bottom-to-top combination, so that the management and maintenance cost of the data, the positioning addressing cost and the transmission cost in the data query can be reduced, and the query efficiency is improved.
The technical scheme for realizing the purpose of the invention is as follows:
a bottom-up merging data slicing optimization method based on a range query boundary set is different from the prior art and comprises the following steps:
1) establishing a data access probability model under a range query load: the set formed by all boundaries defining range query on the data set is called a range query boundary set, under the record-based data organization mode, the query cumulative probability of one data record is the number of times that the data record is accessed by the query load/the total query number, and under the data organization mode based on the data slice, the kth data slice DS is definedkHas a length of lkData slice DSkHas a cumulative probability of PkDue to the fact that for the data slice DSkThe access to the arbitrary record is embodied as a DS to a slice of datakThus the data slice DSkQuery cumulative probability PkValue is DSkMaximum value of query cumulative probability of data record contained, data slice DSkThe cost of the query above is expressed as:
DSkthe query cost is the positioning addressing cost plus the data transmission cost
Disk address cost per location sxdskQuery cumulative probability of Pk+ length of data slice lkX Transmission cost per byte data x DSkQuery cumulative probability of PkAfter data is fragmented, there may be a case of query "false hit", that is, a part of data in the fragment is not a query result set but is accessed, which brings extra transmission overhead, and this part of extra transmission overhead is defined as cost bias, and is denoted by FcThe coarser the granularity of the data fragments, the smaller the positioning addressing cost of data query, and the larger the deviation of the data transmission cost, the larger the data transmission costThe larger the data transmission cost is, and vice versa, the finer the data fragment granularity is, the larger the positioning addressing cost of data query is, and the smaller the data transmission cost is, that is, the positioning addressing cost and the data transmission cost are two mutually restricted indexes, so that the data fragmentation problem is an optimization problem under the query workload in the inclined range;
2) initializing a fragmentation scheme P with a range query boundary set: assuming there are B different elements in the range query boundary set, the data set is initialized into B-1 pieces of data;
3) calculating the cost deviation F caused by merging two adjacent data slicesc(DS1,DS2),FC(DS2,DS3)…, Fc(DSi-1,DSi),Fc(DSi,DSi+1),…,Fc(DSB-2,DSB-1);
4) And traversing the cost deviation array, finding out the minimum cost deviation, and combining the two adjacent data slices: suppose Fc(DSi,DSi+1) If the cost in the cost deviation is the minimum, merging the data slice DSiAnd DSi+1And the merged data slice is: DS (direct sequence)1,…,DSi,DSi+2,…,DSB-1
5) Updating two cost deviation values in the cost deviation array combined and influenced in the step 4), and recalculating Fc(ii) a E.g. merging of data slices DSiAnd DSi+1Is a new DSiWhen F needs to be recalculatedc(DSi-1,DSi) And Fc(DSi+1,DSi+2);
6) Jump to step 4), and circularly execute data slice merging until the data optimal slice number is reached: cost deviation Fc(DSi,DSi+1) May be within a constant time, i.e.
Figure GDA0003013278390000021
Internally completing, B-1 cost deviations need to be executed in the first round of circulation, and the adjacent merged data pieces need to be calculated in each round laterAnd 2 cost deviations, wherein B-K rounds are required to be executed until the number of the remaining data pieces is K, and the total calculation cost is (B-1) +2(B-K), wherein B is the base number of the range query boundary set, and K is the number of the data pieces of the fragments.
From the step 6), the total calculation cost is (B-1) +2(B-K), constant zero is omitted, and the time complexity of the method of the technical scheme is
Figure GDA0003013278390000022
Under the workload of the slant range query, the data fragmentation should adapt to the access mode of the range query as much as possible to reduce the data transmission cost deviation. If the slice position of the data slice is not on the range query boundary, meaningless transmission cost deviation is brought, so that the optimal slice position of the data is always on the range query boundary, and based on the conclusion, the slice position of the data slice is only searched on the boundary point of the range query, namely, the data slice optimization method based on the range query boundary set, namely, in step 2) of the technical scheme, firstly, the basis of initializing the data slice by using the range query boundary set, then, the adjacent data slices are combined iteratively, and the adjacent data slices generating the minimum cost deviation are selected to be combined every time.
The method reduces the management and maintenance cost of the data, obtains the optimal data query cost, and improves the query efficiency.
Drawings
Fig. 1 is a schematic diagram of an embodiment in which the optimal slicing position of data is necessarily located on the boundary of a range query.
Detailed Description
The invention will be further illustrated, but not limited, by the following description of the embodiments with reference to the accompanying drawings.
Example (b):
a bottom-up merging data slicing optimization method based on a range query boundary set is different from the prior art and comprises the following steps:
1) establishing a data access probability model under a range query load: statorDefining a set formed by all boundaries of range query on a data set as a range query boundary set, wherein in a record-based data organization mode, the query cumulative probability of a data record is the number of times that the data record is accessed by a query load/the total query number, and in a data organization mode based on data slices, defining the kth data slice DSkHas a length of lkData slice DSkHas a cumulative probability of PkDue to the fact that for the data slice DSkThe access to the arbitrary record is embodied as a DS to a slice of datakThus the data slice DSkQuery cumulative probability PkValue is DSkMaximum value of query cumulative probability of data record contained, data slice DSkThe cost of the query above is expressed as:
DSkthe query cost is the positioning addressing cost plus the data transmission cost
Disk address cost per location sxdskQuery cumulative probability of Pk+ length of data slice lkX Transmission cost per byte data x DSkQuery cumulative probability of PkAfter data is fragmented, there may be a case of query "false hit", that is, a part of data in the fragment is not a query result set but is accessed, which brings extra transmission overhead, and this part of extra transmission overhead is defined as cost bias, and is denoted by FcThe method comprises the following steps that the data fragmentation granularity is thicker, the positioning addressing cost of data query is smaller, the data transmission cost deviation is larger, the data transmission cost is larger, and vice versa, the data fragmentation granularity is thinner, the positioning addressing cost of data query is larger, the data transmission cost is smaller, namely, the positioning addressing cost and the data transmission cost are two mutually restricted indexes, so that the data fragmentation problem is an optimization problem under the query working load in an inclined range;
2) initializing a fragmentation scheme P with a range query boundary set: assuming there are B different elements in the range query boundary set, the data set is initialized into B-1 pieces of data;
3) calculating the cost deviation F caused by merging two adjacent data slicesc(DS1,DS2),FC(DS2,DS3)…, Fc(DSi-1,DSi),Fc(DSi,DSi+1),…,Fc(DSB-2,DSB-1);
4) And traversing the cost deviation array, finding out the minimum cost deviation, and combining the two adjacent data slices: suppose Fc(DSi,DSi+1) If the cost in the cost deviation is the minimum, merging the data slice DSiAnd DSi+1And the merged data slice is: DS (direct sequence)1,…,DSi,DSi+2,…,DSB-1
5) Updating two cost deviation values in the cost deviation array combined and influenced in the step 4), and recalculating Fc(ii) a E.g. merging of data slices DSiAnd DSi+1Is a new DSiWhen F needs to be recalculatedc(DSi-1,DSi) And Fc(DSi+1,DSi+2);
6) Jump to step 4), and circularly execute data slice merging until the data optimal slice number is reached: cost deviation Fc(DSi,DSi+1) May be within a constant time, i.e.
Figure GDA0003013278390000041
And internally completing, wherein B-1 cost deviations need to be executed in the first round of circulation, next 2 cost deviations adjacent to the merged data piece need to be calculated in each round, B-K rounds need to be executed until the number of the remaining data pieces is K, and the total calculation cost is (B-1) +2(B-K), wherein B is the base number of the range query boundary set, and K is the number of the data pieces of the fragments.
The time complexity of the method of the present embodiment was evaluated: from the step 6), the total calculation cost is (B-1) +2(B-K), constant zero is omitted, and the time complexity of the algorithm is
Figure GDA0003013278390000042
After the data is divided into data slices, the query cumulative probability distribution on the data slices is a fit to the range query cumulative probability distribution, and the fit is biased, called a fit cost bias, which increases the range query cost on the data slices, as shown in fig. 1, and the area of the shaded portion in fig. 1 is the fit cost bias caused by the data slices.
When the requirement of the inclined range query is met, the access mode of the range query needs to be sensed, and based on the sensing, the data is divided into data slices, and the data with strong relevance in the access mode is divided into one data slice. The data slice model based on the association perception can enable the data slices to be hit in a full or large ratio when being queried and accessed in a range, and reduce the transmission cost deviation of data, so that the query efficiency is improved.
Under the inclined range query workload, the data fragment should adapt to the access mode of the range query as much as possible to reduce the transmission cost deviation in the data query, minimize the range query cost on the data set, obtain the optimal query performance, and in order to reduce the DSkQuery cumulative probability of PkThe optimal sliced slice position of the data must fall on the bounds query boundary, as shown in FIG. 1, if the slice position of the data slice does not fall on the bounds query boundary, e.g., b'2Then [ b'2,b2]The data in between are divided into DSs3In the data sheet, [ b'2,b2]The query cumulative probability of the data in between increases and the query cost increases, so the data slicing scheme with slice positions falling on the query boundary of the range, the query cumulative probability distribution of the data slices and the query load probability distribution on the data set are best fitted.
The method presented in this example aims to minimize the fitting cost bias to optimize the range query performance on the data set.
The embodiment initializes the data set into a plurality of data slices, and then iteratively merges adjacent data slices, and each merging selects the merging of the adjacent data slices which generates the minimum cost deviation.

Claims (1)

1. A bottom-up merging data slicing optimization method based on a range query boundary set is characterized by comprising the following steps:
1) establishing a data access probability model under a range query load: the set formed by all boundaries defining range query on the data set is called a range query boundary set, under the record-based data organization mode, the query cumulative probability of one data record is the number of times that the data record is accessed by the query load/the total query number, and under the data organization mode based on the data slice, the kth data slice DS is definedkHas a length of lkData slice DSkHas a cumulative probability of PkData slice DSkQuery cumulative probability PkValue is DSkMaximum value of query cumulative probability of data record contained, data slice DSkThe cost of the query above is expressed as:
DSkthe query cost is the positioning addressing cost plus the data transmission cost
Disk address cost per location sxdskQuery cumulative probability of Pk+ length of data slice lkX Transmission cost per byte data x DSkQuery cumulative probability of PkAfter data fragmentation, there is a case of query "false hit", that is, the partial data in the fragment is not the query result set but will be accessed, which brings extra transmission overhead, and this part of extra transmission overhead is defined as cost bias, and is denoted by FcRepresents;
2) initializing a fragmentation scheme P with a range query boundary set: assuming there are B different elements in the range query boundary set, the data set is initialized into B-1 pieces of data;
3) calculating the cost deviation F caused by merging two adjacent data slicesc(DS1,DS2),Fc(DS2,DS3)…,Fc(DSi-1,DSi),Fc(DSi,DSi+1),…,Fc(DSB-2,DSB-1);
4) And traversing the cost deviation array, finding out the minimum cost deviation, and combining the two adjacent data slices: suppose Fc(DSi,DSi+1) The cost is the minimum in the cost deviation, thenAnd data slice DSiAnd DSi+1And the merged data slice is: DS (direct sequence)1,…,DSi,DSi+2,…,DSB-1
5) Updating two cost deviation values in the cost deviation array combined and influenced in the step 4), and recalculating Fc
6) Jump to step 4), and circularly execute data slice merging until the data optimal slice number is reached: cost deviation Fc(DSi,DSi+1) Within a constant time, i.e.
Figure FDA0003013278380000011
And internally completing, wherein B-1 cost deviations need to be executed in the first round of circulation, next 2 cost deviations adjacent to the merged data piece need to be calculated in each round, B-K rounds need to be executed until the number of the remaining data pieces is K, and the total calculation cost is (B-1) +2(B-K), wherein B is the base number of the range query boundary set, and K is the number of the data pieces of the fragments.
CN201810194425.9A 2018-03-09 2018-03-09 Merged data fragmentation optimization method based on range query boundary set Expired - Fee Related CN108416027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810194425.9A CN108416027B (en) 2018-03-09 2018-03-09 Merged data fragmentation optimization method based on range query boundary set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810194425.9A CN108416027B (en) 2018-03-09 2018-03-09 Merged data fragmentation optimization method based on range query boundary set

Publications (2)

Publication Number Publication Date
CN108416027A CN108416027A (en) 2018-08-17
CN108416027B true CN108416027B (en) 2021-07-20

Family

ID=63130829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810194425.9A Expired - Fee Related CN108416027B (en) 2018-03-09 2018-03-09 Merged data fragmentation optimization method based on range query boundary set

Country Status (1)

Country Link
CN (1) CN108416027B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101256579A (en) * 2008-04-08 2008-09-03 中兴通讯股份有限公司 Method for inquesting data organization in database
CN104516906A (en) * 2013-09-29 2015-04-15 日电(中国)有限公司 Adaptive indexing method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8543579B2 (en) * 2005-06-17 2013-09-24 International Business Machines Corporation Range query methods and apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101256579A (en) * 2008-04-08 2008-09-03 中兴通讯股份有限公司 Method for inquesting data organization in database
CN104516906A (en) * 2013-09-29 2015-04-15 日电(中国)有限公司 Adaptive indexing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A distributed range query framework for the internet of things;Congcong Zhang; Tingting Zhang; Mei Wang;《2015 18th International Conference on Intelligence in Next Generation Networks》;20150402;全文 *
HiBase:一种基于分层式索引的高效HBase查询技术与系统;葛微,罗圣美,周文辉;《计算机学报》;20160131;全文 *

Also Published As

Publication number Publication date
CN108416027A (en) 2018-08-17

Similar Documents

Publication Publication Date Title
US8825629B2 (en) Method for index tuning of a SQL statement, and index merging for a multi-statement SQL workload, using a cost-based relational query optimizer
US6668263B1 (en) Method and system for efficiently searching for free space in a table of a relational database having a clustering index
US20130124534A1 (en) Apparatus and method for information access, search, rank and retrieval
US10394787B2 (en) Indexing methods and systems for spatial data objects
KR100284778B1 (en) Insertion method of high dimensional index structure for content-based image retrieval
US8583657B2 (en) Method and apparatus for using a hash-partitioned index to access a table that is not partitioned or partitioned independently of the hash partitioned index
US9396247B2 (en) Method and device for processing a time sequence based on dimensionality reduction
JP2007521565A (en) Multidimensional data object search using bit vector index
US10915534B2 (en) Extreme value computation
CN112541074A (en) Log analysis method, device, server and storage medium
US10452658B2 (en) Caching methods and a system for entropy-based cardinality estimation
Chen et al. Efficiently evaluating skyline queries on RDF databases
CN108416027B (en) Merged data fragmentation optimization method based on range query boundary set
US20120296906A1 (en) Grid-based data clustering method
CN108920631B (en) File query method, device, equipment and readable storage medium
CN108460137B (en) Range query data fragmentation optimization method based on combined deviation threshold
US9928274B2 (en) Dynamically adjust duplicate skipping method for increased performance
CN107688620B (en) Top-k query-oriented method for instantly diversifying query results
CN112328587A (en) Data processing method and device for ElasticSearch
Dervos et al. S-index: a hybrid structure for text retrieval
Gandhi et al. Experiments on Static Data Summarization Techniques
CN108427747B (en) Dynamic planning data fragmentation optimization method based on range query boundary set
Gandhi et al. Affinity-based Fragmentation for Sensor Data
JP2006106907A (en) Structured document management system, method for constructing index, and program
US20170083567A1 (en) High-dimensional data storage and retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210720

CF01 Termination of patent right due to non-payment of annual fee