CN104809210B

CN104809210B - One kind is based on magnanimity data weighting top k querying methods under distributed computing framework

Info

Publication number: CN104809210B
Application number: CN201510209691.0A
Authority: CN
Inventors: 何洁月; 罗浩
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2015-04-28
Filing date: 2015-04-28
Publication date: 2017-12-26
Anticipated expiration: 2035-04-28
Also published as: CN104809210A

Abstract

The invention discloses a kind of top k enquiring and optimizing methods of mass data under distributed computing framework based on spark, mass data collection is subjected to data segmentation in advance, mainly using the data dividing method of similar grid.Raw data set is divided into different Sub Data Sets, the weight and inquiry k values then assigned according to user to each attribute of data object, a small amount of suitable Sub Data Set is chosen and is inquired about instead of whole data set.The results show set forth herein method inquiry velocity it is very fast, and be with good expansibility.Contrasted with traditional top k querying methods and based on angle and distance data dividing method, improve inquiry velocity, the information that user needs to inquire about can be fed back in time in a short time.

Description

One kind is based on magnanimity data weighting top-k querying methods under distributed computing framework

Technical field

The present invention relates to a kind of data query method, particularly a kind of mass data concentrates the top-k querying methods of weighting.

Background technology

Top-k inquiries also referred to as sequence sensitive queries (rank-aware query), be in database one it is most basic Operation, while be also data analysis important tool, especially in business analysis, often only need to pay close attention to most useful data, and It is not whole data set.

Top-k inquiries are defined as follows：With D={ T₁,T₂,…,T_nRepresent the set of all data objects, T_iRepresent it In i-th of data object, each data object has d dimensions, and is all a point in space.For top-k inquiry Q (f, K), f represents score function, and k represents to return to k result for meeting search request.F is weighted sum function, i.e., in sample One data object T (t₁,t₂,…,t_d), user assigns a weight W (w to each attribute of the data object₁,w₂,…, w_d), the score of each data object is obtained by each property value weighted sum, i.e., scoring function is：

As long as top-k inquiries finally obtain the result set of k element, as long as to the data progress much smaller than input data set Sequence can be obtained by, without handling the data of the overall situation.Recently as the volatile growth of data scale, sea The data scale of amount brings great challenge to data storage, management and analysis.Top-k inquiries are used as one in data analysis Individual basic operation is, it is necessary to quickly obtain Query Result.Such as：In Taobao's magnanimity commodity user according to itself preference to commodity category Property assign different weights, then system quickly returns to the preceding k commodity for meeting user's request according to user's request.

But it is faced with two big challenges for mass data top-k inquiries：First, data scale reaches TB or PB levels, pass The centralized data processing method of system is no longer applicable；Second, how can fast and accurately obtain inquiry knot for mass data collection Fruit.

Top-k inquiries in traditional centralized data system run into performance bottleneck in mass data, so uncomfortable Close the processing of mass data collection.In traditional distributed environment, some research improves inquiry by the caching to Query Result Efficiency, this method is in itself without solving mass data top-k inquiry problem；Some utilizes the inquiry of Skyline profiles Method carries out data processing, proposes a whole set of top-k processing frameworks of DiTo, but be also in traditional distributed environment.

Top-k problems solution most basic under cloud environment is exactly that all data are ranked up and then returned in recent years K result before returning, but inquiry will be handled raw data set this method every time, caused the workload of redundancy, looked into Time length is ask, so inadvisable.Statistical analysis when RanKloud et al. proposes to pass through system operation under MapReduce frameworks To calculate the threshold value that inquiry terminates in advance, this method cannot be guaranteed to obtain k accurate results.Also study by caching machine System has inquired about similitude by comparing in a new inquiry and caching, if similarity degree is big, does not have to inquire about again, though So accelerate inquiry velocity, but Query Result is inaccurate.There is proposition to be inquired about based on angle and distance data division top-k, still Data partition schemes based on angle, data coordinates conversion is complicated time-consuming, so being also not suitable for the processing of mass data collection.

The content of the invention

Goal of the invention：In order to overcome the deficiencies in the prior art, the present invention provides one kind and is based on Distributed Calculation frame The mass data weighting top-k querying methods of frame, can not be fast when handling mass data for solving existing top-k inquiries Speed, technical problem that is accurate, easily obtaining Query Result.

Technical scheme：To achieve the above object, the technical solution adopted by the present invention is：

Following 4 reasonable assumptions are made first：

(1), the negated negative value of any one data object attribute value, even negative value can also pass through the normalizing of data Change, be changed into nonnegative value.

(2), data set is relatively fixed, or data renewal speed for whole data set, can be certain Ignored in time, although for example, the commodity data moment in Jingdone district updating, can be with based on huge commodity radix Think to change less within some period.Therefore, flow data processing is directed to, the inventive method does not apply to.

(3), data are generally evenly distributed in space, are concentrated in mass data, and this assumes to be to meet under many scenes 's.

(4) for an input weight W, meetEven if not being, can also be obtained by normalization.

On the basis of above-mentioned 4 reasonable assumptions, propose a kind of based on magnanimity data weighting under distributed computing framework Top-k querying methods, including the following steps that order performs：

Step 1, establish data space

The property value of all data objects comprising d attribute is converted to nonnegative value first, and property value is entered Row normalized；D dimension coordinates system is established, the axle of coordinate system and the attribute of data object correspond, by all data objects It is positioned in coordinate system and forms data space；

Step 2, data division

Using the origin of coordinate system as starting point, whole coordinate system is divided into m region from inside to outside, m values can not here Negative consequence excessive, that amount of calculation otherwise can be brought to increase, in the case of current data scale, is typically taken as 3~5 by m, Such scope reasonably considers data scale and amount of calculation, certainly, can be with the further increase of later data scale The purpose of the reduction of data total amount in the region that suitably increase m value is marked off with obtaining；By each region from extroversion Interior serial number be 1,2 ... ..., m, and the border in the 1st region and reference axis cooperatively by all data objects all include into Go, to any one region, the maximum of every attribute in the region is identical, and the coordinate of the peripheral boundary in each region meets The coordinate value of at least one axle is the maximum of the attribute in the region, is setting the maximum of the attribute in the 1st region as a₁'s Under the premise of, then the maximum of the attribute of ith zoneI=1,2 ..., m.Drawn according to the above method After dividing good region, understand that the data volume in each region is equal with reference to hypothesis (3) above.

Except one region of outermost, to remaining each region, the attribute by axle that belong to the region and each is maximum For the point of value as basic point, the region that all properties value in whole coordinate system is both greater than to the respective attributes value being equal at basic point is equal Mark off and, be 1,2 according to serial number from outside to inside ... ..., M, wherein M=m-1, by the above-mentioned region for newly marking off and As judging area.

According to Skyline principle, point 2 is both less than to any two point 1 and point 2, such as all properties value of fruit dot 1, then Point 1 supports point 2.Based on above-mentioned principle, if giving two data object T₁And T₂If forThere is T₁Category Property value is more than or equal to T₂Corresponding property value is T₁.t_i≥T₂.t_i, t_iThe property value of expression ith attribute, then any given one Input weight W (w₁,w₂,…,w_d), there must be T₁Score be more than T₂Score be f_W(T₁)≥f_w(T₂)。

Based on above-mentioned analysis, a certain score for judging the data object in area is necessarily both greater than positioned at belonging to this judgement area Region inner side all areas data object score, because the result that algorithm returns is taken from the k of highest scoring, k For algorithm return result set in data object number, so once it is a certain judge area in data object number be more than etc. In k, then this k according to object is obtained from the inside region in the region belonging to the judgement area.Therefore, base In above-mentioned analysis, to judging that area proceeds as follows judgement：

According to number order from small to large, N is judged successively_iWhether >=k sets up, wherein N_iThe data in area are judged for No. i The number of object, k are the number of data object in the result set that algorithm returns；Judge that region meets that above formula is set up when No. i, then tie Beam judges, and i region from outside to inside is scanned for as region of search.

Further, in the present invention, to being finely divided as a region of most inner side in region of search, the region is compiled Number it is i, divided method is as follows：

It is d+1 blocks by the region division, wherein d blocks are search domain to be selected, remaining region in addition to search domain to be selected For must search domain；

It is n=1,2 ... ..., d by search domain to be selected numbering, wherein any one number in n-th of search domain to be selected Strong point T_nj(t_n1,t_n2,…,t_nd), t here_njRepresent data point T_njJ axles corresponding to property value, t_njMeet following 2 Formula：

0≤t_nj≤2a_i+1-a_i, 1≤j≤d and j ≠ n (1) here

a_i-a_i+1≤t_nj≤a_i, j=n (2) here

In n-th of search domain to be selected, if data point T_nj(t_n1,t_n2,…,t_nd) meet attribute corresponding to one of axle It is worth for a_iAnd property value corresponding to remaining axle is 2a_i+1-a_i, then the maximum side using the data point as n-th of search domain to be selected Boundary's point；

Traverse user is given to each attribute weight w at the maximum boundary point of each search domain to be selected_jIt whether there is Meet following condition：

If in the presence of the attribute weight w for the maximum boundary point for meeting above-mentioned condition_j, then region of search range shorter be include from Area and the search domain to be selected belonging to the maximum boundary point must be retrieved in i-1 region outside to inside, ith zone；

If in the absence of the attribute weight w for the maximum boundary point for meeting above-mentioned condition_j, then region of search range shorter is to include Area must be retrieved in i-1 region from outside to inside and ith zone.

Meeting to judge area N_iOn the premise of >=k is set up, divided method will be divided into and must examine positioned at the region of search of most inner side Rope region and search domain to be selected, and appropriate part is further selected from search domain to be selected according to judgment principle and examined Rope, further reduce range of search.According to demonstration above, the score at the maximum boundary point of each search domain to be selected is inevitable More than or equal to the score of the data point of other positions in the search domain to be selected；Therefore, if some search domain to be selected most Score at big boundary point, which is less than, judges area N_iBasic point at score, then this it is to be selected retrieval area need not just retrieved, conversely, then The search domain to be selected then needs to retrieve.

For convenience of explanation, a search domain to be selected is chosen, the coordinate of its maximum boundary point is T (a_i,2a_i+1-a_i,2a_i+1- a_i,…,2a_i+1-a_i), judge area N accordingly_iBasic point coordinate be T (a_i+1,a_i+1,a_i+1,...,a_i+1)；Above-mentioned coordinate is substituted into Scoring function, if there is (a_i,2a_i+1-a_i,2a_i+1-a_i,…,2a_i+1-a_i) * W ＞ (a_i+1,a_i+1,…,a_i+1) * W, here W=(w₁, w₂,…,w_d), then above formula can be deformed intoIt is hereby achieved thatThen need Region to be retrieved corresponding to above-mentioned maximum boundary point is retrieved；In the manner described above, all regions to be retrieved are entered Row judges, you can obtains unified expression formulaHere can from which further follow that, if some region to be retrieved meets Condition, then the attribute weight w for setting up inequality_jNecessarily the property value of the maximum boundary point in the region to be retrieved is a_iCategory Property weight, then traversal retrieval when, as long as being a by the property value of the maximum boundary point in each area to be retrieved_iAttribute Weight is brought intoThe area to be retrieved needs to retrieve if setting up, and once finds a category that above formula is set up Property weight, just do not have to be further continued for examining other areas to be retrieved, because Attribute Weight has reformed normalized, therefore can not possibly be same When have more than 2 or 2 attribute weight meet inequality above；So when selecting area to be retrieved, can be defeated according to user The Attribute Weight weight values entered quickly judge whether an attribute weight of maximum is fullIf it is satisfied, so correspondingly It is a to find out j-th of attribute of maximum boundary point_iArea to be retrieved.

Beneficial effect：

It is provided by the invention a kind of based on magnanimity data weighting top-k querying methods under distributed computing framework, it is proposed that A kind of data partitioning scheme of new similar mesh generation, and by judging data volume and data volume k in result set in area How much carry out contrast and primarily determine that region of search, greatly reduce hunting zone；Then the field of search of most inner side is further reduced The hunting zone in domain so that final region of search is smaller, improves search efficiency and speed.

According to statistical result, the data space for possessing 1,000,000,000 datas is divided into m=3 region, outermost judgement Data volume in area is that the attribute number that each data object includes is presented below as the variation tendency shown in table with dimension d：

Table 1

As seen from table, when dimension d is less than 8,18 data objects, practical application are still suffered from outermost judgement area In, it is often little for the requirement of result set data object number, 10 results are such as returned to, as long as therefore most inquiry outermost areas Data set can in domain, therefore 2/3 hunting zone is at least reduced, filter out a large amount of extraneous datas.Therefore, this hair Bright method significantly improves the top-k query performances under mass data, improves magnanimity higher dimensionality and is inquired about according to the top-k of collection Speed.

Brief description of the drawings

Fig. 1 is the present invention to data partition method schematic diagram；

Fig. 2 is the present invention to data subdividing method schematic diagram；

Fig. 3 represents that three kinds of different pieces of information dividing modes contrast with data dimension difference query time, wherein DistImprove It is the inventive method, AngleDistTop_k is to be based on angle and distance data dividing method, and BasicTop_k is to original number According to collection without dividing query time；

Fig. 4 is represented in the case of dimension 4, is contrasted for different user input weight query time.

Fig. 5 is speed-up ratio of the inventive method in different cluster nodes；

Fig. 6 is top-k query process figures in top-k inquiries specific implementation of the present invention.

Embodiment

The present invention is further described below in conjunction with the accompanying drawings.

Experiment is completed on the spark clusters of 7 nodes, and spark is built on hadoop, uses hadoop Yarn explorers and HDFS document storage systems.Master nodes not only as Driver nodes but also are done in 7 nodes Worker nodes, remaining 6 node is worker nodes.The basic configuration of experimental situation such as table 2 below：

Table 2

Using uniform data set, every record has 8 attributes, the integer between each attribute span [0,1000], 1,000,000,000 records, about 40G data volumes are collectively generated.The also dimension of generation 4,6 dimension data collection, and similar with 8 dimensions simultaneously, all Data set is randomly generated, and is all 1,000,000,000 records.

It is to take average in weight if experiment is without specified otherwise below, the experiment done under conditions of k=100, and often Secondary inquiry is all to have done 10 results averaged.Because the data prediction of inquiry is only used as once, then inquiry every time Without considering data prediction, therefore hereafter query time does not more count the substantial amounts of data prediction time.Present invention side Method is about 42mins for 8 dimension data pretreatment times.

Present embodiment is as shown in fig. 6, be divided into two big steps：

Step 1：Data prediction.It is main according to set forth herein data dividing method raw data set is divided, Different Block is divided into, mark is carried out to each Block, is then stored on HDFS disks.The HDFS disks in spark Mainly it is made up of each worker nodes disk.Judge according to the data partitioning scheme in claim and inquiry data Mode is inquired about.Specific division is as follows：

The first step：Overall segmentation from inside to outside

Whole data space is divided into m=3 with this from inside to outside according to homalographic principle and is divided into 3 big subregions.Such as Shown in Fig. 1, dividing mode is intuitively illustrated by taking two dimension as an example, transverse axis is that x-axis corresponds to attribute 1, the longitudinal axis is that y-axis corresponds to attribute 2, figure In bold portion divide the space into 3 deciles, by each region from outside to inside serial number be 1,2,3.Except outermost one Region, to remaining each region, the attribute using axle that belong to the region and each be the point of maximum as basic point, will be whole The region that all properties value in coordinate system is both greater than the respective attributes value being equal at basic point marks off, according to from outside to inside Serial number be A, B.N is judged successively_A>=k, or n_BThe no establishments of >=k, if square A data volumes n_A>=k is set up, then only With the data set in 1 region of inquiry, the data volume n otherwise checked in square B_AWhether >=k sets up.Mass data is directed to look into Ask generally square A data volumes only can be obtained by top-k results much larger than k with data in 1 region of inquiry, be It is progressive that each big subregion is finely divided to reducing inquiry data volume.

Second step：The subdivision in each big region

Each subregion 1,2 is directed to further to divide, such as big subregion 1, (ABC), D, E can be divided into as shown in Figure 2 Three regions, wherein A, B, the trizonal areas of C are equal, and A, B, C are as must retrieve area, and D, E are as retrieval to be selected Area.

Following truth also be present：If for a data object T₁ScoreSo it is known that f_W (T₁) and spatial data points T₁In straight lineProjected length it is directly proportional, therefore can with projected length come Weigh score function.

Thus, it is supposed that data set is more than or equal to k in square A, judge whether D, E need retrieval only to need in such a way Checking：

If the weight of user's input meetsAnd w₁=w₂=0.5, compare the maximum boundary point d in D regions in figure In straight lineOn the basic point a of subpoint to the distance between origin and square A arrived in the subpoint of above-mentioned straight line The distance between origin, it can be found that the two is equal, similarly, the projection of the maximum boundary point e in E regions on above-mentioned straight line Point to the distance between origin also with square A basic point a above-mentioned straight line subpoint to the equal of the distance between origin, It is therefore not necessary to inquire about D and E regions, top-k results only are can be obtained by with inquiry A, B, C regions, and according to same reason It can also be seen that the top-k results found in A, B, C region necessarily appear in dotted line top-right part in Fig. 2；

It is similar with above-mentioned principle, if w₁＞ w₂, then inquiry D regions are not had to；If w₁＜ w₂, then without query region E, Do not prove one by one herein；

To sum up, equation below can be obtained：

D dimension datas space is generalized to, uses S_i, 1≤i≤3 represents one in the 3 big subregion that divides from inside to outside；S_ij, 1≤j≤d represents big subregion S_iIn be similar to D or E j-th of subregion；S_i(d+1)Represent big subregion S_iIn it is similar be A, B, C Subregion.When the weight of data object jth attributeWhen, then for big subregion S_iOnly need to inquire about 2 therein Region is respectively S_i(d+1)And S_ij；Otherwise it is directed to big subregion S_iOnly with one region S of inquiry_i(d+1)。

Step 2：Query processing.For one inquiry f (W, k) of user, inputted, chosen according to inquiry on Driver nodes Partial data collection is inquired about.It takes k=100 to the present embodiment, and each data object attribute weight takes average, now only with looking into Ask S_1(d+1)Data area.

Such as Fig. 3, represent that three kinds of different pieces of information dividing modes contrast with data dimension difference query time, wherein DistImprove is the inventive method, and AngleDistTop_k is to be based on angle and distance data dividing method, BasicTop_k It is without dividing query time to raw data set；From experiment it can be seen that the inventive method ratio is based on angle and distance data Dividing method more improves inquiry velocity, and inquiry velocity improves about 15%, and as dimension increases query time It is steadily to increase, larger fluctuation does not occur.

Due to the weight W=(w of user₁,w₂,…,w_d) input can influence inquire about data area size, such as Fig. 4 institutes Show, the query time of different weighted values under 4 dimensions, the wherein first kind be with a certain attribute weight is extremely inclined to the characteristics of, Including W₂=(0.06,0.06,0.07,0.8) and W₄=(0.56,0.14,0.25,0.03), the second class W₁=(0.25,0.25, 0.25,0.25) it is equivalent weight, the 3rd class W₃=(0.16,0.32,0.34,0.18) is deviation unobvious weight.W in figure₁With W₃Query time is roughly equal, W₂With W₄Query time is about the same, and W₁With W₃Query time compares W₂With W₄Query time is short, main If due to concentrating different weights to result in the need for inquiring about data block difference, W in low-dimensional data₂With W₄To be extremely inclined to some category Property weight, more some data blocks of inquiry are resulted in the need for, so as to cause query time to be more than w₁With w₃。

The scalability of the inventive method can be seen as shown in figure 5, speed-up ratio in 8 dimension data collection on different nodes Go out speed-up ratio close to preferable speed-up ratio, as processor is worker doubles, performing speed can also double, i.e., number of the present invention It is with good expansibility according to division methods.

Described above is only the preferred embodiment of the present invention, it should be pointed out that：For the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

1. one kind is based on magnanimity data weighting top-k querying methods under distributed computing framework, it is characterised in that：Including sequentially holding Capable following steps：

Step 1, establish data space

The property value of all data objects comprising d attribute is converted to nonnegative value first, and property value is returned One change is handled；D dimension coordinates system is established, the axle of coordinate system and the attribute of data object correspond, by all Data object placements Data space is formed in coordinate system；

Step 2, data division

Using the origin of coordinate system as starting point, whole coordinate system is divided into m region from inside to outside, by each region from outside to inside Serial number is 1,2 ... ..., m, and the border in the 1st region and reference axis cooperatively all include all data objects, To any one region, the maximum of every attribute in the region is identical, and the coordinate of the peripheral boundary in each region meet to The coordinate value of a rare axle is the maximum of the attribute in the region, is setting the maximum of the attribute in the 1st region as a₁Before Put, then the maximum of the attribute of ith zoneExcept one area of outermost Domain, to remaining each region, the attribute using axle that belong to the region and each be the point of maximum as basic point, will entirely sit The region that all properties value in mark system is both greater than the respective attributes value being equal at basic point marks off, according to from outside to inside Serial number is 1,2 ... ..., M, wherein M=m-1, using the above-mentioned new region come of marking off as judging that area proceeds as follows Judge：

According to number order from small to large, N is judged successively_iWhether >=k sets up, wherein N_iThe data object in area is judged for No. i Number, k be algorithm return result set in data object number；Judge that region meets that above formula is set up when No. i, then terminate to sentence It is disconnected, and i region from outside to inside is scanned for as region of search.

2. according to claim 1 be based on magnanimity data weighting top-k querying methods under distributed computing framework, its feature It is：To being finely divided as a region of most inner side in region of search, the zone number is i, and divided method is as follows：

It is d+1 blocks by the region division, wherein d blocks are search domain to be selected, and remaining region in addition to search domain to be selected is must Search domain；

It is n=1,2 ... ..., d by search domain to be selected numbering, wherein any one data point in n-th of search domain to be selected T_nj(t_n1,t_n2,…,t_nd), t here_njRepresent data point T_njJ axles corresponding to property value, t_njMeet following 2 inequality：

0≤t_nj≤2a_i+1-a_i, 1≤j≤d and j ≠ n (1) here

a_i-a_i+1≤t_nj≤a_i, j=n (2) here

In n-th of search domain to be selected, if data point T_nj(t_n1,t_n2,…,t_nd) meet that property value corresponding to one of axle is a_iAnd property value corresponding to remaining axle is 2a_i+1-a_i, then the maximum boundary point using the data point as n-th of search domain to be selected；

Traverse user is given to each attribute weight w at the maximum boundary point of each search domain to be selected_jWhether following bar is met Part：

If in the presence of the attribute weight w for the maximum boundary point for meeting above-mentioned condition_j, then region of search range shorter is to include from extroversion Area and the search domain to be selected belonging to the maximum boundary point must be retrieved in interior i-1 region, ith zone；

If in the absence of the attribute weight w for the maximum boundary point for meeting above-mentioned condition_j, then region of search range shorter is to include from outer Area must be retrieved in i-1 inside region and ith zone.