CN108416027B

CN108416027B - Merged data fragmentation optimization method based on range query boundary set

Info

Publication number: CN108416027B
Application number: CN201810194425.9A
Authority: CN
Inventors: 葛微; 李先贤; 王金艳
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2018-03-09
Filing date: 2018-03-09
Publication date: 2021-07-20
Anticipated expiration: 2038-03-09
Also published as: CN108416027A

Abstract

The invention discloses a bottom-up merging data slicing optimization method based on a range query boundary set, which is characterized by comprising the following steps of: 1) establishing a data access probability model under a range query load; 2) initializing a fragmentation scheme P by using a range query boundary set; 3) calculating the cost deviation F caused by merging two adjacent data slices_c(ii) a 4) Traversing the cost deviation array, finding out the minimum cost deviation, and combining the two adjacent data pieces; 5) updating two cost deviation values in the cost deviation array combined and influenced in the step 4), and recalculating F_c(ii) a 6) Jump to step 4), and loop through the data slice merging until the optimal number of data slices is reached. The method reduces the management and maintenance cost of the data, obtains the optimal data query cost, and improves the query efficiency.

Description

Merged data fragmentation optimization method based on range query boundary set

Technical Field

The invention relates to a data fragmentation optimization technology under a range query load with a tilt characteristic oriented to big data, in particular to a bottom-up merging data fragmentation optimization method based on a range query boundary set.

Background

The data have an association relation, the data inclination means that the association of the data has a certain mode, and finding and utilizing the association mode between the data is an effective method for query optimization. Under a skewed range query load, some consecutive records are often hit by a range query at the same time on some property of the data. From the data management perspective, the records which are frequently hit simultaneously can be regarded as a whole, and are identified by one piece of metadata, and the records are read or skipped integrally during query, so that the management and maintenance cost of the plurality of records can be greatly reduced. In order to obtain optimal range query performance, the optimal slicing position for slicing data must be on the boundary of the range query, because neighboring data that are never sliced by the range query should be considered as a whole and exist in the same data slice.

Disclosure of Invention

The invention aims to provide an efficient optimization slicing method for a data set aiming at the defects of the prior art. The method is based on the range query boundary set as the data initialization fragmentation, and the optimal fragmentation of the data is efficiently realized through bottom-to-top combination, so that the management and maintenance cost of the data, the positioning addressing cost and the transmission cost in the data query can be reduced, and the query efficiency is improved.

The technical scheme for realizing the purpose of the invention is as follows:

a bottom-up merging data slicing optimization method based on a range query boundary set is different from the prior art and comprises the following steps:

1) establishing a data access probability model under a range query load: the set formed by all boundaries defining range query on the data set is called a range query boundary set, under the record-based data organization mode, the query cumulative probability of one data record is the number of times that the data record is accessed by the query load/the total query number, and under the data organization mode based on the data slice, the kth data slice DS is defined_kHas a length of l_kData slice DS_kHas a cumulative probability of P_kDue to the fact that for the data slice DS_kThe access to the arbitrary record is embodied as a DS to a slice of data_kThus the data slice DS_kQuery cumulative probability P_kValue is DS_kMaximum value of query cumulative probability of data record contained, data slice DS_kThe cost of the query above is expressed as:

DS_kthe query cost is the positioning addressing cost plus the data transmission cost

Disk address cost per location sxds_kQuery cumulative probability of P_k+ length of data slice l_kX Transmission cost per byte data x DS_kQuery cumulative probability of P_kAfter data is fragmented, there may be a case of query "false hit", that is, a part of data in the fragment is not a query result set but is accessed, which brings extra transmission overhead, and this part of extra transmission overhead is defined as cost bias, and is denoted by F_cThe coarser the granularity of the data fragments, the smaller the positioning addressing cost of data query, and the larger the deviation of the data transmission cost, the larger the data transmission costThe larger the data transmission cost is, and vice versa, the finer the data fragment granularity is, the larger the positioning addressing cost of data query is, and the smaller the data transmission cost is, that is, the positioning addressing cost and the data transmission cost are two mutually restricted indexes, so that the data fragmentation problem is an optimization problem under the query workload in the inclined range;

2) initializing a fragmentation scheme P with a range query boundary set: assuming there are B different elements in the range query boundary set, the data set is initialized into B-1 pieces of data;

3) calculating the cost deviation F caused by merging two adjacent data slices_c(DS₁，DS₂),F_C(DS₂,DS₃)…, F_c(DS_i-1，DS_i),F_c(DS_i，DS_i+1),…,F_c(DS_B-2，DS_B-1)；

4) And traversing the cost deviation array, finding out the minimum cost deviation, and combining the two adjacent data slices: suppose F_c(DS_i，DS_i+1) If the cost in the cost deviation is the minimum, merging the data slice DS_iAnd DS_i+1And the merged data slice is: DS (direct sequence)₁，…，DS_i，DS_i+2，…，DS_B-1；

5) Updating two cost deviation values in the cost deviation array combined and influenced in the step 4), and recalculating F_c(ii) a E.g. merging of data slices DS_iAnd DS_i+1Is a new DS_iWhen F needs to be recalculated_c(DS_i-1，DS_i) And F_c(DS_i+1，DS_i+2)；

6) Jump to step 4), and circularly execute data slice merging until the data optimal slice number is reached: cost deviation F_c(DS_i，DS_i+1) May be within a constant time, i.e.

Internally completing, B-1 cost deviations need to be executed in the first round of circulation, and the adjacent merged data pieces need to be calculated in each round laterAnd 2 cost deviations, wherein B-K rounds are required to be executed until the number of the remaining data pieces is K, and the total calculation cost is (B-1) +2(B-K), wherein B is the base number of the range query boundary set, and K is the number of the data pieces of the fragments.

From the step 6), the total calculation cost is (B-1) +2(B-K), constant zero is omitted, and the time complexity of the method of the technical scheme is

Under the workload of the slant range query, the data fragmentation should adapt to the access mode of the range query as much as possible to reduce the data transmission cost deviation. If the slice position of the data slice is not on the range query boundary, meaningless transmission cost deviation is brought, so that the optimal slice position of the data is always on the range query boundary, and based on the conclusion, the slice position of the data slice is only searched on the boundary point of the range query, namely, the data slice optimization method based on the range query boundary set, namely, in step 2) of the technical scheme, firstly, the basis of initializing the data slice by using the range query boundary set, then, the adjacent data slices are combined iteratively, and the adjacent data slices generating the minimum cost deviation are selected to be combined every time.

The method reduces the management and maintenance cost of the data, obtains the optimal data query cost, and improves the query efficiency.

Drawings

Fig. 1 is a schematic diagram of an embodiment in which the optimal slicing position of data is necessarily located on the boundary of a range query.

Detailed Description

The invention will be further illustrated, but not limited, by the following description of the embodiments with reference to the accompanying drawings.

Example (b):

1) establishing a data access probability model under a range query load: statorDefining a set formed by all boundaries of range query on a data set as a range query boundary set, wherein in a record-based data organization mode, the query cumulative probability of a data record is the number of times that the data record is accessed by a query load/the total query number, and in a data organization mode based on data slices, defining the kth data slice DS_kHas a length of l_kData slice DS_kHas a cumulative probability of P_kDue to the fact that for the data slice DS_kThe access to the arbitrary record is embodied as a DS to a slice of data_kThus the data slice DS_kQuery cumulative probability P_kValue is DS_kMaximum value of query cumulative probability of data record contained, data slice DS_kThe cost of the query above is expressed as:

Disk address cost per location sxds_kQuery cumulative probability of P_k+ length of data slice l_kX Transmission cost per byte data x DS_kQuery cumulative probability of P_kAfter data is fragmented, there may be a case of query "false hit", that is, a part of data in the fragment is not a query result set but is accessed, which brings extra transmission overhead, and this part of extra transmission overhead is defined as cost bias, and is denoted by F_cThe method comprises the following steps that the data fragmentation granularity is thicker, the positioning addressing cost of data query is smaller, the data transmission cost deviation is larger, the data transmission cost is larger, and vice versa, the data fragmentation granularity is thinner, the positioning addressing cost of data query is larger, the data transmission cost is smaller, namely, the positioning addressing cost and the data transmission cost are two mutually restricted indexes, so that the data fragmentation problem is an optimization problem under the query working load in an inclined range;

And internally completing, wherein B-1 cost deviations need to be executed in the first round of circulation, next 2 cost deviations adjacent to the merged data piece need to be calculated in each round, B-K rounds need to be executed until the number of the remaining data pieces is K, and the total calculation cost is (B-1) +2(B-K), wherein B is the base number of the range query boundary set, and K is the number of the data pieces of the fragments.

The time complexity of the method of the present embodiment was evaluated: from the step 6), the total calculation cost is (B-1) +2(B-K), constant zero is omitted, and the time complexity of the algorithm is

After the data is divided into data slices, the query cumulative probability distribution on the data slices is a fit to the range query cumulative probability distribution, and the fit is biased, called a fit cost bias, which increases the range query cost on the data slices, as shown in fig. 1, and the area of the shaded portion in fig. 1 is the fit cost bias caused by the data slices.

When the requirement of the inclined range query is met, the access mode of the range query needs to be sensed, and based on the sensing, the data is divided into data slices, and the data with strong relevance in the access mode is divided into one data slice. The data slice model based on the association perception can enable the data slices to be hit in a full or large ratio when being queried and accessed in a range, and reduce the transmission cost deviation of data, so that the query efficiency is improved.

Under the inclined range query workload, the data fragment should adapt to the access mode of the range query as much as possible to reduce the transmission cost deviation in the data query, minimize the range query cost on the data set, obtain the optimal query performance, and in order to reduce the DS_kQuery cumulative probability of P_kThe optimal sliced slice position of the data must fall on the bounds query boundary, as shown in FIG. 1, if the slice position of the data slice does not fall on the bounds query boundary, e.g., b'₂Then [ b'₂，b₂]The data in between are divided into DSs₃In the data sheet, [ b'₂，b₂]The query cumulative probability of the data in between increases and the query cost increases, so the data slicing scheme with slice positions falling on the query boundary of the range, the query cumulative probability distribution of the data slices and the query load probability distribution on the data set are best fitted.

The method presented in this example aims to minimize the fitting cost bias to optimize the range query performance on the data set.

The embodiment initializes the data set into a plurality of data slices, and then iteratively merges adjacent data slices, and each merging selects the merging of the adjacent data slices which generates the minimum cost deviation.

Claims

1. A bottom-up merging data slicing optimization method based on a range query boundary set is characterized by comprising the following steps:

1) establishing a data access probability model under a range query load: the set formed by all boundaries defining range query on the data set is called a range query boundary set, under the record-based data organization mode, the query cumulative probability of one data record is the number of times that the data record is accessed by the query load/the total query number, and under the data organization mode based on the data slice, the kth data slice DS is defined_kHas a length of l_kData slice DS_kHas a cumulative probability of P_kData slice DS_kQuery cumulative probability P_kValue is DS_kMaximum value of query cumulative probability of data record contained, data slice DS_kThe cost of the query above is expressed as:

Disk address cost per location sxds_kQuery cumulative probability of P_k+ length of data slice l_kX Transmission cost per byte data x DS_kQuery cumulative probability of P_kAfter data fragmentation, there is a case of query "false hit", that is, the partial data in the fragment is not the query result set but will be accessed, which brings extra transmission overhead, and this part of extra transmission overhead is defined as cost bias, and is denoted by F_cRepresents;

3) calculating the cost deviation F caused by merging two adjacent data slices_c(DS₁，DS₂)，F_c(DS₂，DS₃)…，F_c(DS_i-1，DS_i)，F_c(DS_i，DS_i+1)，…，F_c(DS_B-2，DS_B-1)；

4) And traversing the cost deviation array, finding out the minimum cost deviation, and combining the two adjacent data slices: suppose F_c(DS_i，DS_i+1) The cost is the minimum in the cost deviation, thenAnd data slice DS_iAnd DS_i+1And the merged data slice is: DS (direct sequence)₁，…，DS_i，DS_i+2，…，DS_B-1；

5) Updating two cost deviation values in the cost deviation array combined and influenced in the step 4), and recalculating F_c；

6) Jump to step 4), and circularly execute data slice merging until the data optimal slice number is reached: cost deviation F_c(DS_i，DS_i+1) Within a constant time, i.e.