CN108304449B

CN108304449B - Big data Top-k query method based on self-adaptive data set partitioning mode

Info

Publication number: CN108304449B
Application number: CN201711305053.4A
Authority: CN
Inventors: 徐维祥; 赵博
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2017-12-11
Filing date: 2017-12-11
Publication date: 2022-02-15
Anticipated expiration: 2037-12-11
Also published as: CN108304449A

Abstract

The invention discloses a big data Top-k query method based on a self-adaptive data set partitioning mode, which comprises the following steps: initializing the system, and constructing a hyperplane cluster and a data set; carrying out self-adaptive division on the data set to obtain stable k-cut points; performing Top-k ordering query on the data set; and adaptively adjusting the system data set and establishing a common data set. The invention provides a big data Top-k query method based on a self-adaptive data set division mode, which is suitable for big data Top-k query in a cloud environment.

Description

Big data Top-k query method based on self-adaptive data set partitioning mode

Technical Field

The invention relates to a Top-k query method. And more particularly, to a big data Top-k query method based on an adaptive data set partitioning mode.

Background

Distributed Top-k queries are of increasing interest as the amount of data increases. Distributed Top-k (Top k items) query is to calculate the Top k objects with the maximum global convergence value and the convergence value by the central computing node through converging data lists distributed at different geographic positions. Each item in the data list is a data pair < object, object value >, and both the object and object value in the data pair contain sensitive information of the data provider. Distributed Top-k query computation has wide application in the technical fields of network and system monitoring, information acquisition, sensor networks, P2P systems, data flow control systems and the like.

From the aspect of data partitioning, the Top-k problem in a distributed environment can be summarized into two major categories, namely vertical partitioning and horizontal partitioning. The so-called vertical partitioning is that data is partitioned according to attributes, and is similar to a column storage mode of a relational database, and the partitioning mode is mostly used in early distributed Top-k query research. Around the Top-k query problem, many beneficial research efforts have been undertaken in recent years. However, both the relational database and the traditional distributed environment are difficult to effectively cope with Top-k query in a big data environment, mainly because data objects and processing methods are changed greatly

At present, a big data environment mainly relates to a cloud environment, and the basic principle of data division in the cloud environment is as follows: the data is divided as evenly as possible across the various servers. This uniformity is not only reflected in the uniformity of the amount of data, but more importantly, in the face of a particular application, this partitioning ensures that the data on each server contributes as much as possible to the end result. Further, the representative horizontal division in the Top-k domain is as follows: random partitioning, grid-based, angle-based, and hyperplane-based. Big data Top-k query in cloud environment faces new challenges. The Top-k problem has a very direct solution under the MapReduce framework, namely, data sorting is carried out by using MapReduce and then the Top k values are returned. The scheme not only accords with the characteristics of MapReduce batch processing, but also is easy to realize, but has the biggest defect of overlong processing time. Every time a new query comes, all data needs to be processed once, and the method is not advisable when the data size is large and the query is frequent.

Therefore, it is necessary to provide a big data Top-k query method based on an adaptive data set partitioning method.

Disclosure of Invention

The invention aims to provide a big data Top-k query method based on a self-adaptive data set partitioning mode.

In order to achieve the purpose, the invention adopts the following technical scheme:

a big data Top-k query method based on a self-adaptive data set partitioning mode comprises the following steps:

s1: initializing the system, and constructing a hyperplane cluster and a data set;

s2: carrying out self-adaptive division on the data set to obtain stable k-cut points;

s3: performing Top-k ordering query on the data set;

s4: and adaptively adjusting the system data set and establishing a common data set.

Preferably, step S1 includes:

s101: let the j-th element allocation in the user's query requestRequest weight value of p_jEach of p_jThe combined column vector is P, an

S102: let the j-th dimension attribute variable be y_jEach y_jThe combined column vector is Y, and

Y^T＝[y₁,y₂,y₃,…]；

s103: constructing a hyperplane cluster F according to the query request weight vector P, an

F＝Y^T*P；

S104: determining the dimensionality of a data set as N and the data of the data set as x_ij。

Further preferably, step S2 includes:

s201: obtaining the maximum value of each dimensionality of the data set as p_jmaxDetermining the dataset space and moving each dimension to a fixed interval [0,10 ]]Mapping is carried out; wherein, the maximum point is set as M₀And M is₀＝(x_max,1，x_max,2，x_max,3…) as initial point;

s202: establishing a virtual coordinate system, setting the number of coordinate axes as N, and placing all data in the coordinate system;

s203: define k-cut point M: let M ═ M₁，m₂，m₃,…,m_j…), in an N-dimensional dataset, the k-cut points M are parallel lines along any of the dimensional coordinate axes, and the dataset space is cut to 2^NEach dimension coordinate proportion of the k-cutting point M is fixed, and 3 regions appear in the divided data set;

s204: searching proper M by using variable speed step length to enable hot area data formed by linear cutting of all dimensions to contain k data points, and ensuring that at least k data out of the hyperplane exist under the condition of any query request weight value;

s205: and obtaining a stable k-cutting point by a variable speed step search method.

Further preferably, the 3 regions of the segmented data set include: hot zones, cold zones, and other zones, wherein,

any data point of the hot area is outside a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis;

any data point of the cold area is in a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis;

the other regions are regions of the data set excluding the cold and hot regions.

Preferably, p is_jIs in the form of a column vector of P^T＝[p₁,p₂,p₃,…]To, for

Is provided with

1,

If the user input weight is not in the (0,1) interval, the user input weight is mapped into the (0,1) interval.

Preferably, the shifting step search method includes:

(1) setting an initial step length h_oStep change rate v, convergence strength s ≥ 1, and initial point M₀＝(x_max,1，x_max,2，x_max,3…), mapping each dimension coordinate to be within a (0, 100) range;

(2) let i equal 0, h_i＝h₀，M_i+1＝M_i-h_iThe data set has one data point, and each attribute value is greater than M_i+1And storing the portion of data;

(3) if l > s × k, executing step (4); if k is less than l and less than s, the calculation is finished, and a stable k-cutting point is obtained; if l is less than k, executing the step (5);

(4) let i equal i +1, h_i＝v*h_i,M_i+1＝M_i+h_iReturning to the step (3);

(5) let i equal i +1, h_i＝v*h_i,M_i+1＝M_i-h_iReturn stepStep (3)

Preferably, the initial step size h_o10, the convergence strength s is more than or equal to 1.

Preferably, step S3 includes:

s301: receiving query request information, and constructing a hyperplane cluster according to the request dimension weight P: y is^TP ═ F; bringing in k-cut points, determining the hyperplane Y corresponding to the request^T*P＝F_i；

S302: and calculating the evaluation score of data except the cold area data according to the weight value vector of the query request, and performing sequencing query by using a Top-k query algorithm.

Further preferably, step S4 includes:

s401: and aiming at the data set with frequent data change, the data segmentation state and the k-cut point M are adjusted according to the increase of new data in the data set.

S402: creating a history record set of an output result, storing data points output each time, recording the output times of the data points, and after n times of query, the result is close to convergence, and the history record set is used as a TOP-k common data set at the moment, so that the use times are reduced;

s403: and recording the hyperplane coefficient vector and the corresponding k-cut point at each output.

Further preferably, adjusting according to the addition of new data to the data set comprises:

(1) comparing each dimension attribute data of the incoming data with each dimension attribute data of the k-cut point M:

if the data pair

All have p_j-new≥p_j-MThen the data point falls within the hot zone data range;

if the data pair

All have p_j-new≤p_j-MThen the data point falls within the cold zone data range;

otherwise, the data point falls within the other region data range;

(2) if the number of hot zone data increases beyond the predetermined threshold of the total number of the zone data, the step S205 is returned to continue the variable speed step search until the convergence condition is satisfied.

The invention has the following beneficial effects:

the invention provides a big data Top-k query method based on a self-adaptive data set division mode, which is suitable for big data Top-k query in a cloud environment.

Drawings

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

FIG. 1 shows a flowchart of a top-k sorting query method of an adaptive dataset partition mode.

Fig. 2 shows a graphical representation of the definition of the K-cut point in a two-dimensional condition.

Fig. 3 shows a flow chart of the variable step search method.

Detailed Description

In order to more clearly illustrate the invention, the invention is further described below with reference to preferred embodiments and the accompanying drawings. Similar parts in the figures are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.

The invention provides a big data Top-k query method based on a self-adaptive data set partitioning mode, which comprises the following steps:

s1: and initializing the system, and constructing a hyperplane cluster and a data set.

S101: setting the request weight value assigned to the jth element in the user query request as p_jEach of p_jThe combined column vector is P, an

Y^T＝[y₁,y₂,y₃,…]；

F＝Y^T*P；

S2: and carrying out self-adaptive division on the data set to obtain a stable k-cut point.

In the present invention, the 3 kinds of regions where the segmented data set appears include: the system comprises a hot area, a cold area and other areas, wherein any data point of the hot area is outside a space formed by the hyperplane cluster and the positive direction of the coordinate axis; any data point of the cold area is in a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis; the other regions are regions of the data set excluding the cold and hot regions.

p_jIs in the form of a column vector of P^T＝[p₁,p₂,p₃,…]To, for

Has p_j∈(0,1),∑p_j＝1,

The speed change step search method comprises the following steps:

(4) let i equal i +1, h_i＝v*h_i,M_i+1＝M_i+h_iReturning to the step (3);

(5) let i equal i +1, h_i＝v*h_i,M_i+1＝M_i-h_iReturning to step (3)

Wherein, the initial step length h_o10, the convergence strength s is more than or equal to 1.

S3: and performing Top-k sorting query on the data set.

S301: receiving query request information, and constructing a hyperplane cluster according to the request dimension weight P: y is^TP ═ F; bringing in a k-cut point, determining a corresponding requestIs in a hyperplane Y^T*P＝F_i；

Adjusting according to the addition of new data to the data set includes:

if the data pair

if the data pair

otherwise, the data point falls within the other region data range;

The following description is given with reference to a specific embodiment

Aiming at the problems that the request weight values in most Top-k user requests are all larger than 0, and the attribute values in the data set are positive values, the method mainly solves the problem of Top-k big data query under the above conditions, and the negative value condition can be converted through a specific method.

1. Let p_jIndicating the assigned request weight value, p, of the jth element in a user query request_jIs in the form of a column vector of P^T＝[p₁,p₂,p₃,…]To, for

Has p_j∈(0,1),∑p_j＝1,

In practical application, if the user input weight is not in the (0,1) interval, it needs to be mapped to [0,1 ] first]Within the interval; most Top-k user requests have a request weight value greater than 0.

2. Let the j-th dimension attribute variable be y_jY is each Y_jCombined column vectors, Y^T＝[y₁,y₂,y₃,…]。

3. Setting a hyperplane cluster Y^TF is a hyperplane cluster constructed according to the weight vector P of the query request, wherein F is an unknown parameter, any data point is taken in to obtain the value of F, and a hyperplane expression Y under the query request is determined^T*P＝F_request-i。

4. Determining the dimensionality of the data set as N, and setting the data of the data set as x_ij。

5. As shown in FIG. 2, the maximum value p of each dimension of the data set is obtained_jmaxDetermine the DataSet Space (DataSet Space) and move the dimensions to a fixed interval [0,10 ]]Mapping; let M₀＝(x_max,1，x_max,2，x_max,3…) maximum point, where the maximum point does not necessarily exist in the dataset, M is the maximum point₀＝(x_max,1，x_max,2，x_max,3…) as an initial point.

6. And establishing a virtual coordinate system, wherein the number of coordinate axes is N, and placing all data in the coordinate system.

7. Define k-cut point M: let M ═ M₁，m₂，m₃,…,m_j…), in an N-dimensional dataset, the k-cut points M are parallel lines along any of the dimensional coordinate axes, and the dataset space is cut to 2^NAnd the coordinate proportion of each dimension of the k-cutting point M is fixed, namely the k-cutting point M can be regarded as the point from the origin O to the maximum value point M₀Moving a point on the connecting line, and randomly passing through a hyperplane cluster formed by hyperplanes of the k-cutting point, wherein the hyperplane cluster is limited by the query weight value: to pair

Has p_j∈(0,1),∑p_j＝1,

The segmented data set will appear to have 3 regions:

(1) any data point is outside a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis, the part becomes a 'hot zone', and hot zone data has great influence in all Top-k queries based on the current k-cut point M;

(2) any data point is outside a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis, the part becomes a cold area, the influence of cold area data in all Top-k queries based on the current k-cut point M is small, and the data hardly enters the Top-k sequencing data range.

8. And searching proper M by using variable speed step length to enable hot zone data formed by linear cutting of each dimension to contain k data points, and ensuring that at least k data out of the hyperplane exist for the Top-k query method to use under the condition of any query request weight value.

9. As shown in fig. 3, the shift step search includes:

(1) setting an initial step length h_oUsually 10, the step-length change rate v, the convergence strength s ≧ 1, and the value is generally in the (1.5,2) interval with the initial point M₀＝(x_max,1，x_max,2，x_max,3…), mapping each dimension coordinate into a (0, 100) range;

(2)i＝0，h_i＝h₀，M_i+1＝M_i-h_ithe data set has l data points, each attribute value is greater than M_i+1And storing the portion of data;

(3) and (3) judging: if l is more than s x k, entering step (4), if k is more than l and less than s x k, finishing the algorithm to obtain a stable k-cutting point, and if l is more than k, entering step (5);

(4)i＝i+1,h_i＝v*h_i,M_i+1＝M_i+h_ireturning to the step (3);

(5)i＝i+1,h_i＝v*h_i,M_i+1＝M_i-h_iand (4) returning to the step (3).

10. For a data set with frequent data change, the data segmentation state and the k-cut point M need to be adjusted according to the increase of new data in the data set, and the process is as follows:

(1) the dimension attribute data of the incoming data is compared to the dimension attribute data of the k-cut point M,

a. if the data pair

All have p_j-new≥p_j-MThen the data point falls within the "hot zone" data range;

b. if the data pair

All have p_j-new≤p_j-MThen the data falls within the "cold zone" data range;

c. if the two conditions are not the two conditions, the data points become other data areas;

(2) in the above three cases, when the number of the situation hot area data increases by more than 20% of the total amount of the area data, the condition is satisfied by returning to the continuous variable speed step search.

11. And removing 'cold zone' data from the divided data, and then performing Top-k sequencing.

12. Receiving query request information, and constructing a hyperplane cluster according to the request dimension weight P:

Y^T*P＝F。

13. bringing in k-cut points, determining the hyperplane Y corresponding to the request^T*P＝F_i。

14. And creating a k × N +1 list, wherein N is the total number of dimensions, respectively calculating the scores of the data in the area A under the weight vector P, after each calculation, putting the data into the list one by one, and arranging the list according to the ascending order of the scores. When the number of data in the list exceeds k, comparing the calculated score with the score of the first data in the list each time, and if the new score is high, sequentially comparing the scores backwards until the data with higher scores is encountered or the last data in the list is reached; if the score is lower than the first-bit data, the data is discarded, the calculation is continued until all the data are calculated, and the Top-k method is stopped.

15. And all the node results are sent to the summarizing task distribution node together, the scores of the Top-k results are compared uniformly, the final Top-k result is obtained, and the final Top-k result is sent to the user.

16. And creating a history record set of output results, storing data points output each time, recording the output times of the data points, and after n times of inquiry, enabling the results to be close to convergence, wherein the history record set is used as a TOP-k common data set at the moment, so that the use times are reduced.

17. And recording the hyperplane coefficient vector and the corresponding k-cut point at each output.

It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention, and it will be obvious to those skilled in the art that other variations or modifications may be made on the basis of the above description, and all embodiments may not be exhaustive, and all obvious variations or modifications may be included within the scope of the present invention.

Claims

1. A big data Top-k query method based on a self-adaptive data set partitioning mode is characterized by comprising the following steps:

the step S1 includes:

Y^T＝[y₁,y₂,y₃,…]；

F＝Y^T*P；

S104: determining the dimensionality of a data set as N and the data of the data set as x_ij；

the step S2 includes:

the 3 regions of occurrence of the segmented data set include: hot zones, cold zones, and other zones, wherein,

the other areas are areas except the cold area and the hot area in the data set;

s205: obtaining a stable k-cutting point by a variable speed step length search method;

said p is_jIndicating the assigned request weight value, p, of the jth element in a user query request_jIs in the form of a column vector of P^T＝[p₁,p₂,p₃,…]To, for

Has p_j∈(0,1),∑p_j＝1,

If the user input weight is not in the (0,1) interval, mapping the user input weight into the (0,1) interval;

s3: performing Top-k ordering query on the data set;

the step S3 includes:

S302: calculating data evaluation scores except for cold area data according to the query request weight value vector, and performing sequencing query by using a Top-k query algorithm;

2. The big data Top-k query method according to claim 1, wherein the variable speed step search method comprises:

(2) let i equal 0, h_i＝h₀，M_i+1＝M_i-h_iThe data set has one data point, and each attribute value is greater than M_i+1And storing data of said data set;

(3) if l > s × k, executing step (4); if k < l < s x k, finishing the calculation and obtaining a stable k-cutting point; if l < k, executing step (5);

(4) let i equal i +1, h_i＝v*h_i,M_i+1＝M_i+h_iReturning to the step (3);

(5) let i equal i +1, h_i＝v*h_i,M_i+1＝M_i-h_iAnd (4) returning to the step (3).

3. The big-data Top-k query method according to claim 2, wherein the initial step size h_o10, the convergence strength s is more than or equal to 1.

4. The big data Top-k query method according to claim 1, wherein the step S4 includes:

s401: aiming at a data set with frequent data change, adjusting a data segmentation state and a k-cut point M according to the increase of new data of the data set;

5. The big-data Top-k query method according to claim 4, wherein the adjusting according to the increase of new data in the data set comprises:

if the data pair

if the data pair

All have p_j-new≤p_j-MThen the data point falls within the cold zone data range, where p_j-newRepresenting the request weight value distributed by the j-new element in the user query request; p is a radical of_j-MRepresenting the request weight value distributed to the j-M elements in the user query request;

otherwise, the data point falls within the other region data range;