CN108304449B - Big data Top-k query method based on self-adaptive data set partitioning mode - Google Patents

Big data Top-k query method based on self-adaptive data set partitioning mode Download PDF

Info

Publication number
CN108304449B
CN108304449B CN201711305053.4A CN201711305053A CN108304449B CN 108304449 B CN108304449 B CN 108304449B CN 201711305053 A CN201711305053 A CN 201711305053A CN 108304449 B CN108304449 B CN 108304449B
Authority
CN
China
Prior art keywords
data
data set
point
query
hyperplane
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201711305053.4A
Other languages
Chinese (zh)
Other versions
CN108304449A (en
Inventor
徐维祥
赵博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN201711305053.4A priority Critical patent/CN108304449B/en
Publication of CN108304449A publication Critical patent/CN108304449A/en
Application granted granted Critical
Publication of CN108304449B publication Critical patent/CN108304449B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a big data Top-k query method based on a self-adaptive data set partitioning mode, which comprises the following steps: initializing the system, and constructing a hyperplane cluster and a data set; carrying out self-adaptive division on the data set to obtain stable k-cut points; performing Top-k ordering query on the data set; and adaptively adjusting the system data set and establishing a common data set. The invention provides a big data Top-k query method based on a self-adaptive data set division mode, which is suitable for big data Top-k query in a cloud environment.

Description

Big data Top-k query method based on self-adaptive data set partitioning mode
Technical Field
The invention relates to a Top-k query method. And more particularly, to a big data Top-k query method based on an adaptive data set partitioning mode.
Background
Distributed Top-k queries are of increasing interest as the amount of data increases. Distributed Top-k (Top k items) query is to calculate the Top k objects with the maximum global convergence value and the convergence value by the central computing node through converging data lists distributed at different geographic positions. Each item in the data list is a data pair < object, object value >, and both the object and object value in the data pair contain sensitive information of the data provider. Distributed Top-k query computation has wide application in the technical fields of network and system monitoring, information acquisition, sensor networks, P2P systems, data flow control systems and the like.
From the aspect of data partitioning, the Top-k problem in a distributed environment can be summarized into two major categories, namely vertical partitioning and horizontal partitioning. The so-called vertical partitioning is that data is partitioned according to attributes, and is similar to a column storage mode of a relational database, and the partitioning mode is mostly used in early distributed Top-k query research. Around the Top-k query problem, many beneficial research efforts have been undertaken in recent years. However, both the relational database and the traditional distributed environment are difficult to effectively cope with Top-k query in a big data environment, mainly because data objects and processing methods are changed greatly
At present, a big data environment mainly relates to a cloud environment, and the basic principle of data division in the cloud environment is as follows: the data is divided as evenly as possible across the various servers. This uniformity is not only reflected in the uniformity of the amount of data, but more importantly, in the face of a particular application, this partitioning ensures that the data on each server contributes as much as possible to the end result. Further, the representative horizontal division in the Top-k domain is as follows: random partitioning, grid-based, angle-based, and hyperplane-based. Big data Top-k query in cloud environment faces new challenges. The Top-k problem has a very direct solution under the MapReduce framework, namely, data sorting is carried out by using MapReduce and then the Top k values are returned. The scheme not only accords with the characteristics of MapReduce batch processing, but also is easy to realize, but has the biggest defect of overlong processing time. Every time a new query comes, all data needs to be processed once, and the method is not advisable when the data size is large and the query is frequent.
Therefore, it is necessary to provide a big data Top-k query method based on an adaptive data set partitioning method.
Disclosure of Invention
The invention aims to provide a big data Top-k query method based on a self-adaptive data set partitioning mode.
In order to achieve the purpose, the invention adopts the following technical scheme:
a big data Top-k query method based on a self-adaptive data set partitioning mode comprises the following steps:
s1: initializing the system, and constructing a hyperplane cluster and a data set;
s2: carrying out self-adaptive division on the data set to obtain stable k-cut points;
s3: performing Top-k ordering query on the data set;
s4: and adaptively adjusting the system data set and establishing a common data set.
Preferably, step S1 includes:
s101: let the j-th element allocation in the user's query requestRequest weight value of pjEach of pjThe combined column vector is P, an
Figure BDA0001501824060000021
S102: let the j-th dimension attribute variable be yjEach yjThe combined column vector is Y, and
YT=[y1,y2,y3,…];
s103: constructing a hyperplane cluster F according to the query request weight vector P, an
F=YT*P;
S104: determining the dimensionality of a data set as N and the data of the data set as xij
Further preferably, step S2 includes:
s201: obtaining the maximum value of each dimensionality of the data set as pjmaxDetermining the dataset space and moving each dimension to a fixed interval [0,10 ]]Mapping is carried out; wherein, the maximum point is set as M0And M is0=(xmax,1,xmax,2,xmax,3…) as initial point;
s202: establishing a virtual coordinate system, setting the number of coordinate axes as N, and placing all data in the coordinate system;
s203: define k-cut point M: let M ═ M1,m2,m3,…,mj…), in an N-dimensional dataset, the k-cut points M are parallel lines along any of the dimensional coordinate axes, and the dataset space is cut to 2NEach dimension coordinate proportion of the k-cutting point M is fixed, and 3 regions appear in the divided data set;
s204: searching proper M by using variable speed step length to enable hot area data formed by linear cutting of all dimensions to contain k data points, and ensuring that at least k data out of the hyperplane exist under the condition of any query request weight value;
s205: and obtaining a stable k-cutting point by a variable speed step search method.
Further preferably, the 3 regions of the segmented data set include: hot zones, cold zones, and other zones, wherein,
any data point of the hot area is outside a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis;
any data point of the cold area is in a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis;
the other regions are regions of the data set excluding the cold and hot regions.
Preferably, p isjIs in the form of a column vector of PT=[p1,p2,p3,…]To, for
Figure BDA0001501824060000031
Is provided with
Figure BDA0001501824060000032
1,
Figure BDA0001501824060000033
If the user input weight is not in the (0,1) interval, the user input weight is mapped into the (0,1) interval.
Preferably, the shifting step search method includes:
(1) setting an initial step length hoStep change rate v, convergence strength s ≥ 1, and initial point M0=(xmax,1,xmax,2,xmax,3…), mapping each dimension coordinate to be within a (0, 100) range;
(2) let i equal 0, hi=h0,Mi+1=Mi-hiThe data set has one data point, and each attribute value is greater than Mi+1And storing the portion of data;
(3) if l > s × k, executing step (4); if k is less than l and less than s, the calculation is finished, and a stable k-cutting point is obtained; if l is less than k, executing the step (5);
(4) let i equal i +1, hi=v*hi,Mi+1=Mi+hiReturning to the step (3);
(5) let i equal i +1, hi=v*hi,Mi+1=Mi-hiReturn stepStep (3)
Preferably, the initial step size ho10, the convergence strength s is more than or equal to 1.
Preferably, step S3 includes:
s301: receiving query request information, and constructing a hyperplane cluster according to the request dimension weight P: y isTP ═ F; bringing in k-cut points, determining the hyperplane Y corresponding to the requestT*P=Fi
S302: and calculating the evaluation score of data except the cold area data according to the weight value vector of the query request, and performing sequencing query by using a Top-k query algorithm.
Further preferably, step S4 includes:
s401: and aiming at the data set with frequent data change, the data segmentation state and the k-cut point M are adjusted according to the increase of new data in the data set.
S402: creating a history record set of an output result, storing data points output each time, recording the output times of the data points, and after n times of query, the result is close to convergence, and the history record set is used as a TOP-k common data set at the moment, so that the use times are reduced;
s403: and recording the hyperplane coefficient vector and the corresponding k-cut point at each output.
Further preferably, adjusting according to the addition of new data to the data set comprises:
(1) comparing each dimension attribute data of the incoming data with each dimension attribute data of the k-cut point M:
if the data pair
Figure BDA0001501824060000041
All have pj-new≥pj-MThen the data point falls within the hot zone data range;
if the data pair
Figure BDA0001501824060000042
All have pj-new≤pj-MThen the data point falls within the cold zone data range;
otherwise, the data point falls within the other region data range;
(2) if the number of hot zone data increases beyond the predetermined threshold of the total number of the zone data, the step S205 is returned to continue the variable speed step search until the convergence condition is satisfied.
The invention has the following beneficial effects:
the invention provides a big data Top-k query method based on a self-adaptive data set division mode, which is suitable for big data Top-k query in a cloud environment.
Drawings
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
FIG. 1 shows a flowchart of a top-k sorting query method of an adaptive dataset partition mode.
Fig. 2 shows a graphical representation of the definition of the K-cut point in a two-dimensional condition.
Fig. 3 shows a flow chart of the variable step search method.
Detailed Description
In order to more clearly illustrate the invention, the invention is further described below with reference to preferred embodiments and the accompanying drawings. Similar parts in the figures are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.
The invention provides a big data Top-k query method based on a self-adaptive data set partitioning mode, which comprises the following steps:
s1: and initializing the system, and constructing a hyperplane cluster and a data set.
S101: setting the request weight value assigned to the jth element in the user query request as pjEach of pjThe combined column vector is P, an
Figure BDA0001501824060000051
S102: let the j-th dimension attribute variable be yjEach yjThe combined column vector is Y, and
YT=[y1,y2,y3,…];
s103: constructing a hyperplane cluster F according to the query request weight vector P, an
F=YT*P;
S104: determining the dimensionality of a data set as N and the data of the data set as xij
S2: and carrying out self-adaptive division on the data set to obtain a stable k-cut point.
S201: obtaining the maximum value of each dimensionality of the data set as pjmaxDetermining the dataset space and moving each dimension to a fixed interval [0,10 ]]Mapping is carried out; wherein, the maximum point is set as M0And M is0=(xmax,1,xmax,2,xmax,3…) as initial point;
s202: establishing a virtual coordinate system, setting the number of coordinate axes as N, and placing all data in the coordinate system;
s203: define k-cut point M: let M ═ M1,m2,m3,…,mj…), in an N-dimensional dataset, the k-cut points M are parallel lines along any of the dimensional coordinate axes, and the dataset space is cut to 2NEach dimension coordinate proportion of the k-cutting point M is fixed, and 3 regions appear in the divided data set;
s204: searching proper M by using variable speed step length to enable hot area data formed by linear cutting of all dimensions to contain k data points, and ensuring that at least k data out of the hyperplane exist under the condition of any query request weight value;
s205: and obtaining a stable k-cutting point by a variable speed step search method.
In the present invention, the 3 kinds of regions where the segmented data set appears include: the system comprises a hot area, a cold area and other areas, wherein any data point of the hot area is outside a space formed by the hyperplane cluster and the positive direction of the coordinate axis; any data point of the cold area is in a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis; the other regions are regions of the data set excluding the cold and hot regions.
pjIs in the form of a column vector of PT=[p1,p2,p3,…]To, for
Figure BDA0001501824060000052
Has pj∈(0,1),∑pj=1,
Figure BDA0001501824060000053
If the user input weight is not in the (0,1) interval, the user input weight is mapped into the (0,1) interval.
The speed change step search method comprises the following steps:
(1) setting an initial step length hoStep change rate v, convergence strength s ≥ 1, and initial point M0=(xmax,1,xmax,2,xmax,3…), mapping each dimension coordinate to be within a (0, 100) range;
(2) let i equal 0, hi=h0,Mi+1=Mi-hiThe data set has one data point, and each attribute value is greater than Mi+1And storing the portion of data;
(3) if l > s × k, executing step (4); if k is less than l and less than s, the calculation is finished, and a stable k-cutting point is obtained; if l is less than k, executing the step (5);
(4) let i equal i +1, hi=v*hi,Mi+1=Mi+hiReturning to the step (3);
(5) let i equal i +1, hi=v*hi,Mi+1=Mi-hiReturning to step (3)
Wherein, the initial step length ho10, the convergence strength s is more than or equal to 1.
S3: and performing Top-k sorting query on the data set.
S301: receiving query request information, and constructing a hyperplane cluster according to the request dimension weight P: y isTP ═ F; bringing in a k-cut point, determining a corresponding requestIs in a hyperplane YT*P=Fi
S302: and calculating the evaluation score of data except the cold area data according to the weight value vector of the query request, and performing sequencing query by using a Top-k query algorithm.
S4: and adaptively adjusting the system data set and establishing a common data set.
S401: and aiming at the data set with frequent data change, the data segmentation state and the k-cut point M are adjusted according to the increase of new data in the data set.
S402: creating a history record set of an output result, storing data points output each time, recording the output times of the data points, and after n times of query, the result is close to convergence, and the history record set is used as a TOP-k common data set at the moment, so that the use times are reduced;
s403: and recording the hyperplane coefficient vector and the corresponding k-cut point at each output.
Adjusting according to the addition of new data to the data set includes:
(1) comparing each dimension attribute data of the incoming data with each dimension attribute data of the k-cut point M:
if the data pair
Figure BDA0001501824060000061
All have pj-new≥pj-MThen the data point falls within the hot zone data range;
if the data pair
Figure BDA0001501824060000062
All have pj-new≤pj-MThen the data point falls within the cold zone data range;
otherwise, the data point falls within the other region data range;
(2) if the number of hot zone data increases beyond the predetermined threshold of the total number of the zone data, the step S205 is returned to continue the variable speed step search until the convergence condition is satisfied.
The following description is given with reference to a specific embodiment
Aiming at the problems that the request weight values in most Top-k user requests are all larger than 0, and the attribute values in the data set are positive values, the method mainly solves the problem of Top-k big data query under the above conditions, and the negative value condition can be converted through a specific method.
1. Let pjIndicating the assigned request weight value, p, of the jth element in a user query requestjIs in the form of a column vector of PT=[p1,p2,p3,…]To, for
Figure BDA0001501824060000063
Has pj∈(0,1),∑pj=1,
Figure BDA0001501824060000064
In practical application, if the user input weight is not in the (0,1) interval, it needs to be mapped to [0,1 ] first]Within the interval; most Top-k user requests have a request weight value greater than 0.
2. Let the j-th dimension attribute variable be yjY is each YjCombined column vectors, YT=[y1,y2,y3,…]。
3. Setting a hyperplane cluster YTF is a hyperplane cluster constructed according to the weight vector P of the query request, wherein F is an unknown parameter, any data point is taken in to obtain the value of F, and a hyperplane expression Y under the query request is determinedT*P=Frequest-i
4. Determining the dimensionality of the data set as N, and setting the data of the data set as xij
5. As shown in FIG. 2, the maximum value p of each dimension of the data set is obtainedjmaxDetermine the DataSet Space (DataSet Space) and move the dimensions to a fixed interval [0,10 ]]Mapping; let M0=(xmax,1,xmax,2,xmax,3…) maximum point, where the maximum point does not necessarily exist in the dataset, M is the maximum point0=(xmax,1,xmax,2,xmax,3…) as an initial point.
6. And establishing a virtual coordinate system, wherein the number of coordinate axes is N, and placing all data in the coordinate system.
7. Define k-cut point M: let M ═ M1,m2,m3,…,mj…), in an N-dimensional dataset, the k-cut points M are parallel lines along any of the dimensional coordinate axes, and the dataset space is cut to 2NAnd the coordinate proportion of each dimension of the k-cutting point M is fixed, namely the k-cutting point M can be regarded as the point from the origin O to the maximum value point M0Moving a point on the connecting line, and randomly passing through a hyperplane cluster formed by hyperplanes of the k-cutting point, wherein the hyperplane cluster is limited by the query weight value: to pair
Figure BDA0001501824060000071
Has pj∈(0,1),∑pj=1,
Figure BDA0001501824060000072
The segmented data set will appear to have 3 regions:
(1) any data point is outside a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis, the part becomes a 'hot zone', and hot zone data has great influence in all Top-k queries based on the current k-cut point M;
(2) any data point is outside a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis, the part becomes a cold area, the influence of cold area data in all Top-k queries based on the current k-cut point M is small, and the data hardly enters the Top-k sequencing data range.
8. And searching proper M by using variable speed step length to enable hot zone data formed by linear cutting of each dimension to contain k data points, and ensuring that at least k data out of the hyperplane exist for the Top-k query method to use under the condition of any query request weight value.
9. As shown in fig. 3, the shift step search includes:
(1) setting an initial step length hoUsually 10, the step-length change rate v, the convergence strength s ≧ 1, and the value is generally in the (1.5,2) interval with the initial point M0=(xmax,1,xmax,2,xmax,3…), mapping each dimension coordinate into a (0, 100) range;
(2)i=0,hi=h0,Mi+1=Mi-hithe data set has l data points, each attribute value is greater than Mi+1And storing the portion of data;
(3) and (3) judging: if l is more than s x k, entering step (4), if k is more than l and less than s x k, finishing the algorithm to obtain a stable k-cutting point, and if l is more than k, entering step (5);
(4)i=i+1,hi=v*hi,Mi+1=Mi+hireturning to the step (3);
(5)i=i+1,hi=v*hi,Mi+1=Mi-hiand (4) returning to the step (3).
10. For a data set with frequent data change, the data segmentation state and the k-cut point M need to be adjusted according to the increase of new data in the data set, and the process is as follows:
(1) the dimension attribute data of the incoming data is compared to the dimension attribute data of the k-cut point M,
a. if the data pair
Figure BDA0001501824060000081
All have pj-new≥pj-MThen the data point falls within the "hot zone" data range;
b. if the data pair
Figure BDA0001501824060000082
All have pj-new≤pj-MThen the data falls within the "cold zone" data range;
c. if the two conditions are not the two conditions, the data points become other data areas;
(2) in the above three cases, when the number of the situation hot area data increases by more than 20% of the total amount of the area data, the condition is satisfied by returning to the continuous variable speed step search.
11. And removing 'cold zone' data from the divided data, and then performing Top-k sequencing.
12. Receiving query request information, and constructing a hyperplane cluster according to the request dimension weight P:
YT*P=F。
13. bringing in k-cut points, determining the hyperplane Y corresponding to the requestT*P=Fi
14. And creating a k × N +1 list, wherein N is the total number of dimensions, respectively calculating the scores of the data in the area A under the weight vector P, after each calculation, putting the data into the list one by one, and arranging the list according to the ascending order of the scores. When the number of data in the list exceeds k, comparing the calculated score with the score of the first data in the list each time, and if the new score is high, sequentially comparing the scores backwards until the data with higher scores is encountered or the last data in the list is reached; if the score is lower than the first-bit data, the data is discarded, the calculation is continued until all the data are calculated, and the Top-k method is stopped.
15. And all the node results are sent to the summarizing task distribution node together, the scores of the Top-k results are compared uniformly, the final Top-k result is obtained, and the final Top-k result is sent to the user.
16. And creating a history record set of output results, storing data points output each time, recording the output times of the data points, and after n times of inquiry, enabling the results to be close to convergence, wherein the history record set is used as a TOP-k common data set at the moment, so that the use times are reduced.
17. And recording the hyperplane coefficient vector and the corresponding k-cut point at each output.
It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention, and it will be obvious to those skilled in the art that other variations or modifications may be made on the basis of the above description, and all embodiments may not be exhaustive, and all obvious variations or modifications may be included within the scope of the present invention.

Claims (5)

1. A big data Top-k query method based on a self-adaptive data set partitioning mode is characterized by comprising the following steps:
s1: initializing the system, and constructing a hyperplane cluster and a data set;
the step S1 includes:
s101: setting the request weight value assigned to the jth element in the user query request as pjEach of pjThe combined column vector is P, an
Figure FDA0003274285560000011
S102: let the j-th dimension attribute variable be yjEach yjThe combined column vector is Y, and
YT=[y1,y2,y3,…];
s103: constructing a hyperplane cluster F according to the query request weight vector P, an
F=YT*P;
S104: determining the dimensionality of a data set as N and the data of the data set as xij
S2: carrying out self-adaptive division on the data set to obtain stable k-cut points;
the step S2 includes:
s201: obtaining the maximum value of each dimensionality of the data set as pjmaxDetermining the dataset space and moving each dimension to a fixed interval [0,10 ]]Mapping is carried out; wherein, the maximum point is set as M0And M is0=(xmax,1,xmax,2,xmax,3…) as initial point;
s202: establishing a virtual coordinate system, setting the number of coordinate axes as N, and placing all data in the coordinate system;
s203: define k-cut point M: let M ═ M1,m2,m3,…,mj…), in an N-dimensional dataset, the k-cut points M are parallel lines along any of the dimensional coordinate axes, and the dataset space is cut to 2NEach dimension coordinate proportion of the k-cutting point M is fixed, and 3 regions appear in the divided data set;
the 3 regions of occurrence of the segmented data set include: hot zones, cold zones, and other zones, wherein,
any data point of the hot area is outside a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis;
any data point of the cold area is in a space enclosed by the hyperplane cluster and the positive direction of the coordinate axis;
the other areas are areas except the cold area and the hot area in the data set;
s204: searching proper M by using variable speed step length to enable hot area data formed by linear cutting of all dimensions to contain k data points, and ensuring that at least k data out of the hyperplane exist under the condition of any query request weight value;
s205: obtaining a stable k-cutting point by a variable speed step length search method;
said p isjIndicating the assigned request weight value, p, of the jth element in a user query requestjIs in the form of a column vector of PT=[p1,p2,p3,…]To, for
Figure FDA0003274285560000021
Has pj∈(0,1),∑pj=1,
Figure FDA0003274285560000022
If the user input weight is not in the (0,1) interval, mapping the user input weight into the (0,1) interval;
s3: performing Top-k ordering query on the data set;
the step S3 includes:
s301: receiving query request information, and constructing a hyperplane cluster according to the request dimension weight P: y isTP ═ F; bringing in k-cut points, determining the hyperplane Y corresponding to the requestT*P=Fi
S302: calculating data evaluation scores except for cold area data according to the query request weight value vector, and performing sequencing query by using a Top-k query algorithm;
s4: and adaptively adjusting the system data set and establishing a common data set.
2. The big data Top-k query method according to claim 1, wherein the variable speed step search method comprises:
(1) setting an initial step length hoStep change rate v, convergence strength s ≥ 1, and initial point M0=(xmax,1,xmax,2,xmax,3…), mapping each dimension coordinate to be within a (0, 100) range;
(2) let i equal 0, hi=h0,Mi+1=Mi-hiThe data set has one data point, and each attribute value is greater than Mi+1And storing data of said data set;
(3) if l > s × k, executing step (4); if k < l < s x k, finishing the calculation and obtaining a stable k-cutting point; if l < k, executing step (5);
(4) let i equal i +1, hi=v*hi,Mi+1=Mi+hiReturning to the step (3);
(5) let i equal i +1, hi=v*hi,Mi+1=Mi-hiAnd (4) returning to the step (3).
3. The big-data Top-k query method according to claim 2, wherein the initial step size ho10, the convergence strength s is more than or equal to 1.
4. The big data Top-k query method according to claim 1, wherein the step S4 includes:
s401: aiming at a data set with frequent data change, adjusting a data segmentation state and a k-cut point M according to the increase of new data of the data set;
s402: creating a history record set of an output result, storing data points output each time, recording the output times of the data points, and after n times of query, the result is close to convergence, and the history record set is used as a TOP-k common data set at the moment, so that the use times are reduced;
s403: and recording the hyperplane coefficient vector and the corresponding k-cut point at each output.
5. The big-data Top-k query method according to claim 4, wherein the adjusting according to the increase of new data in the data set comprises:
(1) comparing each dimension attribute data of the incoming data with each dimension attribute data of the k-cut point M:
if the data pair
Figure FDA0003274285560000031
All have pj-new≥pj-MThen the data point falls within the hot zone data range;
if the data pair
Figure FDA0003274285560000032
All have pj-new≤pj-MThen the data point falls within the cold zone data range, where pj-newRepresenting the request weight value distributed by the j-new element in the user query request; p is a radical ofj-MRepresenting the request weight value distributed to the j-M elements in the user query request;
otherwise, the data point falls within the other region data range;
(2) if the number of hot zone data increases beyond the predetermined threshold of the total number of the zone data, the step S205 is returned to continue the variable speed step search until the convergence condition is satisfied.
CN201711305053.4A 2017-12-11 2017-12-11 Big data Top-k query method based on self-adaptive data set partitioning mode Expired - Fee Related CN108304449B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711305053.4A CN108304449B (en) 2017-12-11 2017-12-11 Big data Top-k query method based on self-adaptive data set partitioning mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711305053.4A CN108304449B (en) 2017-12-11 2017-12-11 Big data Top-k query method based on self-adaptive data set partitioning mode

Publications (2)

Publication Number Publication Date
CN108304449A CN108304449A (en) 2018-07-20
CN108304449B true CN108304449B (en) 2022-02-15

Family

ID=62870459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711305053.4A Expired - Fee Related CN108304449B (en) 2017-12-11 2017-12-11 Big data Top-k query method based on self-adaptive data set partitioning mode

Country Status (1)

Country Link
CN (1) CN108304449B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111831660B (en) * 2020-07-16 2021-03-30 深圳大学 Method and device for evaluating metric space division mode, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1659509A2 (en) * 2004-11-22 2006-05-24 AT&T Corp. Adaptive processing of top-k queries in nested-structure arbitrary mark-up language such as XML
CN105117497A (en) * 2015-09-28 2015-12-02 上海海洋大学 Ocean big data master-slave index system and method based on Spark cloud network
CN106296343A (en) * 2016-08-01 2017-01-04 王四春 A kind of e-commerce transaction monitoring method based on the Internet and big data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1659509A2 (en) * 2004-11-22 2006-05-24 AT&T Corp. Adaptive processing of top-k queries in nested-structure arbitrary mark-up language such as XML
CN105117497A (en) * 2015-09-28 2015-12-02 上海海洋大学 Ocean big data master-slave index system and method based on Spark cloud network
CN106296343A (en) * 2016-08-01 2017-01-04 王四春 A kind of e-commerce transaction monitoring method based on the Internet and big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Efficient Keyword-Based Search for Top-K Cells in Text Cube;Bolin Ding等;《 IEEE Transactions on Knowledge and Data Engineering》;20110210;全文 *
一种云环境下的大数据Top-K查询方法;慈祥等;《软件学报》;20141231;全文 *

Also Published As

Publication number Publication date
CN108304449A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN108733798B (en) Knowledge graph-based personalized recommendation method
CN105335519B (en) Model generation method and device and recommendation method and device
CN103116639B (en) Based on article recommend method and the system of user-article bipartite graph model
Ouyang et al. Estimating parameters of Muskingum model using an adaptive hybrid PSO algorithm
CN107944485B (en) Recommendation system and method based on cluster group discovery and personalized recommendation system
CN110069500B (en) Dynamic mixed indexing method for non-relational database
CN107301583B (en) Cold start recommendation method based on user preference and trust
CN103455531B (en) A kind of parallel index method supporting high dimensional data to have inquiry partially in real time
Laohakiat et al. A clustering algorithm for stream data with LDA-based unsupervised localized dimension reduction
CN109033101A (en) Label recommendation method and device
CN104615638A (en) Big-data-oriented distributed density clustering method
CN102253961A (en) Method for querying road network k aggregation nearest neighboring node based on Voronoi graph
Wang Application of grey relation analysis theory to choose high reliability of the network node
CN103605723A (en) Video recommending method based on particle swarm algorithm
CN110580252B (en) Space object indexing and query method under multi-objective optimization
Gao et al. Research on data stream clustering based on fcm algorithm1
CN108304449B (en) Big data Top-k query method based on self-adaptive data set partitioning mode
CN107423319B (en) Junk web page detection method
CN110175287B (en) Flink-based matrix decomposition implicit feedback recommendation method and system
Svynchuk et al. Modification of Query Processing Methods in Distributed Databases Using Fractal Trees.
Setiowati et al. Point of Interest (POI) Recommendation System using Implicit Feedback Based on K-Means+ Clustering and User-Based Collaborative Filtering
Dzolkhifli et al. A skyline query processing approach over interval uncertain data stream with k-means clustering technique
KR102437799B1 (en) Efficient Distributed In-memory High-dimensional Indexing System for Searching Objects in Images
CN113536085A (en) Topic word search crawler scheduling method and system based on combined prediction method
Zhang et al. Improvement of Filtering Algorithm for RFID Middleware Using KDB-tree Query Index.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220215

CF01 Termination of patent right due to non-payment of annual fee