CN104809210B - One kind is based on magnanimity data weighting top k querying methods under distributed computing framework - Google Patents
One kind is based on magnanimity data weighting top k querying methods under distributed computing framework Download PDFInfo
- Publication number
- CN104809210B CN104809210B CN201510209691.0A CN201510209691A CN104809210B CN 104809210 B CN104809210 B CN 104809210B CN 201510209691 A CN201510209691 A CN 201510209691A CN 104809210 B CN104809210 B CN 104809210B
- Authority
- CN
- China
- Prior art keywords
- region
- data
- attribute
- point
- maximum
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24578—Query processing with adaptation to user needs using ranking
Abstract
The invention discloses a kind of top k enquiring and optimizing methods of mass data under distributed computing framework based on spark, mass data collection is subjected to data segmentation in advance, mainly using the data dividing method of similar grid.Raw data set is divided into different Sub Data Sets, the weight and inquiry k values then assigned according to user to each attribute of data object, a small amount of suitable Sub Data Set is chosen and is inquired about instead of whole data set.The results show set forth herein method inquiry velocity it is very fast, and be with good expansibility.Contrasted with traditional top k querying methods and based on angle and distance data dividing method, improve inquiry velocity, the information that user needs to inquire about can be fed back in time in a short time.
Description
Technical field
The present invention relates to a kind of data query method, particularly a kind of mass data concentrates the top-k querying methods of weighting.
Background technology
Top-k inquiries also referred to as sequence sensitive queries (rank-aware query), be in database one it is most basic
Operation, while be also data analysis important tool, especially in business analysis, often only need to pay close attention to most useful data, and
It is not whole data set.
Top-k inquiries are defined as follows:With D={ T1,T2,…,TnRepresent the set of all data objects, TiRepresent it
In i-th of data object, each data object has d dimensions, and is all a point in space.For top-k inquiry Q (f,
K), f represents score function, and k represents to return to k result for meeting search request.F is weighted sum function, i.e., in sample
One data object T (t1,t2,…,td), user assigns a weight W (w to each attribute of the data object1,w2,…,
wd), the score of each data object is obtained by each property value weighted sum, i.e., scoring function is:
As long as top-k inquiries finally obtain the result set of k element, as long as to the data progress much smaller than input data set
Sequence can be obtained by, without handling the data of the overall situation.Recently as the volatile growth of data scale, sea
The data scale of amount brings great challenge to data storage, management and analysis.Top-k inquiries are used as one in data analysis
Individual basic operation is, it is necessary to quickly obtain Query Result.Such as:In Taobao's magnanimity commodity user according to itself preference to commodity category
Property assign different weights, then system quickly returns to the preceding k commodity for meeting user's request according to user's request.
But it is faced with two big challenges for mass data top-k inquiries:First, data scale reaches TB or PB levels, pass
The centralized data processing method of system is no longer applicable;Second, how can fast and accurately obtain inquiry knot for mass data collection
Fruit.
Top-k inquiries in traditional centralized data system run into performance bottleneck in mass data, so uncomfortable
Close the processing of mass data collection.In traditional distributed environment, some research improves inquiry by the caching to Query Result
Efficiency, this method is in itself without solving mass data top-k inquiry problem;Some utilizes the inquiry of Skyline profiles
Method carries out data processing, proposes a whole set of top-k processing frameworks of DiTo, but be also in traditional distributed environment.
Top-k problems solution most basic under cloud environment is exactly that all data are ranked up and then returned in recent years
K result before returning, but inquiry will be handled raw data set this method every time, caused the workload of redundancy, looked into
Time length is ask, so inadvisable.Statistical analysis when RanKloud et al. proposes to pass through system operation under MapReduce frameworks
To calculate the threshold value that inquiry terminates in advance, this method cannot be guaranteed to obtain k accurate results.Also study by caching machine
System has inquired about similitude by comparing in a new inquiry and caching, if similarity degree is big, does not have to inquire about again, though
So accelerate inquiry velocity, but Query Result is inaccurate.There is proposition to be inquired about based on angle and distance data division top-k, still
Data partition schemes based on angle, data coordinates conversion is complicated time-consuming, so being also not suitable for the processing of mass data collection.
The content of the invention
Goal of the invention:In order to overcome the deficiencies in the prior art, the present invention provides one kind and is based on Distributed Calculation frame
The mass data weighting top-k querying methods of frame, can not be fast when handling mass data for solving existing top-k inquiries
Speed, technical problem that is accurate, easily obtaining Query Result.
Technical scheme:To achieve the above object, the technical solution adopted by the present invention is:
Following 4 reasonable assumptions are made first:
(1), the negated negative value of any one data object attribute value, even negative value can also pass through the normalizing of data
Change, be changed into nonnegative value.
(2), data set is relatively fixed, or data renewal speed for whole data set, can be certain
Ignored in time, although for example, the commodity data moment in Jingdone district updating, can be with based on huge commodity radix
Think to change less within some period.Therefore, flow data processing is directed to, the inventive method does not apply to.
(3), data are generally evenly distributed in space, are concentrated in mass data, and this assumes to be to meet under many scenes
's.
(4) for an input weight W, meetEven if not being, can also be obtained by normalization.
On the basis of above-mentioned 4 reasonable assumptions, propose a kind of based on magnanimity data weighting under distributed computing framework
Top-k querying methods, including the following steps that order performs:
Step 1, establish data space
The property value of all data objects comprising d attribute is converted to nonnegative value first, and property value is entered
Row normalized;D dimension coordinates system is established, the axle of coordinate system and the attribute of data object correspond, by all data objects
It is positioned in coordinate system and forms data space;
Step 2, data division
Using the origin of coordinate system as starting point, whole coordinate system is divided into m region from inside to outside, m values can not here
Negative consequence excessive, that amount of calculation otherwise can be brought to increase, in the case of current data scale, is typically taken as 3~5 by m,
Such scope reasonably considers data scale and amount of calculation, certainly, can be with the further increase of later data scale
The purpose of the reduction of data total amount in the region that suitably increase m value is marked off with obtaining;By each region from extroversion
Interior serial number be 1,2 ... ..., m, and the border in the 1st region and reference axis cooperatively by all data objects all include into
Go, to any one region, the maximum of every attribute in the region is identical, and the coordinate of the peripheral boundary in each region meets
The coordinate value of at least one axle is the maximum of the attribute in the region, is setting the maximum of the attribute in the 1st region as a1's
Under the premise of, then the maximum of the attribute of ith zoneI=1,2 ..., m.Drawn according to the above method
After dividing good region, understand that the data volume in each region is equal with reference to hypothesis (3) above.
Except one region of outermost, to remaining each region, the attribute by axle that belong to the region and each is maximum
For the point of value as basic point, the region that all properties value in whole coordinate system is both greater than to the respective attributes value being equal at basic point is equal
Mark off and, be 1,2 according to serial number from outside to inside ... ..., M, wherein M=m-1, by the above-mentioned region for newly marking off and
As judging area.
According to Skyline principle, point 2 is both less than to any two point 1 and point 2, such as all properties value of fruit dot 1, then
Point 1 supports point 2.Based on above-mentioned principle, if giving two data object T1And T2If forThere is T1Category
Property value is more than or equal to T2Corresponding property value is T1.ti≥T2.ti, tiThe property value of expression ith attribute, then any given one
Input weight W (w1,w2,…,wd), there must be T1Score be more than T2Score be fW(T1)≥fw(T2)。
Based on above-mentioned analysis, a certain score for judging the data object in area is necessarily both greater than positioned at belonging to this judgement area
Region inner side all areas data object score, because the result that algorithm returns is taken from the k of highest scoring, k
For algorithm return result set in data object number, so once it is a certain judge area in data object number be more than etc.
In k, then this k according to object is obtained from the inside region in the region belonging to the judgement area.Therefore, base
In above-mentioned analysis, to judging that area proceeds as follows judgement:
According to number order from small to large, N is judged successivelyiWhether >=k sets up, wherein NiThe data in area are judged for No. i
The number of object, k are the number of data object in the result set that algorithm returns;Judge that region meets that above formula is set up when No. i, then tie
Beam judges, and i region from outside to inside is scanned for as region of search.
Further, in the present invention, to being finely divided as a region of most inner side in region of search, the region is compiled
Number it is i, divided method is as follows:
It is d+1 blocks by the region division, wherein d blocks are search domain to be selected, remaining region in addition to search domain to be selected
For must search domain;
It is n=1,2 ... ..., d by search domain to be selected numbering, wherein any one number in n-th of search domain to be selected
Strong point Tnj(tn1,tn2,…,tnd), t herenjRepresent data point TnjJ axles corresponding to property value, tnjMeet following 2
Formula:
0≤tnj≤2ai+1-ai, 1≤j≤d and j ≠ n (1) here
ai-ai+1≤tnj≤ai, j=n (2) here
In n-th of search domain to be selected, if data point Tnj(tn1,tn2,…,tnd) meet attribute corresponding to one of axle
It is worth for aiAnd property value corresponding to remaining axle is 2ai+1-ai, then the maximum side using the data point as n-th of search domain to be selected
Boundary's point;
Traverse user is given to each attribute weight w at the maximum boundary point of each search domain to be selectedjIt whether there is
Meet following condition:
If in the presence of the attribute weight w for the maximum boundary point for meeting above-mentioned conditionj, then region of search range shorter be include from
Area and the search domain to be selected belonging to the maximum boundary point must be retrieved in i-1 region outside to inside, ith zone;
If in the absence of the attribute weight w for the maximum boundary point for meeting above-mentioned conditionj, then region of search range shorter is to include
Area must be retrieved in i-1 region from outside to inside and ith zone.
Meeting to judge area NiOn the premise of >=k is set up, divided method will be divided into and must examine positioned at the region of search of most inner side
Rope region and search domain to be selected, and appropriate part is further selected from search domain to be selected according to judgment principle and examined
Rope, further reduce range of search.According to demonstration above, the score at the maximum boundary point of each search domain to be selected is inevitable
More than or equal to the score of the data point of other positions in the search domain to be selected;Therefore, if some search domain to be selected most
Score at big boundary point, which is less than, judges area NiBasic point at score, then this it is to be selected retrieval area need not just retrieved, conversely, then
The search domain to be selected then needs to retrieve.
For convenience of explanation, a search domain to be selected is chosen, the coordinate of its maximum boundary point is T (ai,2ai+1-ai,2ai+1-
ai,…,2ai+1-ai), judge area N accordinglyiBasic point coordinate be T (ai+1,ai+1,ai+1,...,ai+1);Above-mentioned coordinate is substituted into
Scoring function, if there is (ai,2ai+1-ai,2ai+1-ai,…,2ai+1-ai) * W > (ai+1,ai+1,…,ai+1) * W, here W=(w1,
w2,…,wd), then above formula can be deformed intoIt is hereby achieved thatThen need
Region to be retrieved corresponding to above-mentioned maximum boundary point is retrieved;In the manner described above, all regions to be retrieved are entered
Row judges, you can obtains unified expression formulaHere can from which further follow that, if some region to be retrieved meets
Condition, then the attribute weight w for setting up inequalityjNecessarily the property value of the maximum boundary point in the region to be retrieved is aiCategory
Property weight, then traversal retrieval when, as long as being a by the property value of the maximum boundary point in each area to be retrievediAttribute
Weight is brought intoThe area to be retrieved needs to retrieve if setting up, and once finds a category that above formula is set up
Property weight, just do not have to be further continued for examining other areas to be retrieved, because Attribute Weight has reformed normalized, therefore can not possibly be same
When have more than 2 or 2 attribute weight meet inequality above;So when selecting area to be retrieved, can be defeated according to user
The Attribute Weight weight values entered quickly judge whether an attribute weight of maximum is fullIf it is satisfied, so correspondingly
It is a to find out j-th of attribute of maximum boundary pointiArea to be retrieved.
Beneficial effect:
It is provided by the invention a kind of based on magnanimity data weighting top-k querying methods under distributed computing framework, it is proposed that
A kind of data partitioning scheme of new similar mesh generation, and by judging data volume and data volume k in result set in area
How much carry out contrast and primarily determine that region of search, greatly reduce hunting zone;Then the field of search of most inner side is further reduced
The hunting zone in domain so that final region of search is smaller, improves search efficiency and speed.
According to statistical result, the data space for possessing 1,000,000,000 datas is divided into m=3 region, outermost judgement
Data volume in area is that the attribute number that each data object includes is presented below as the variation tendency shown in table with dimension d:
Table 1
As seen from table, when dimension d is less than 8,18 data objects, practical application are still suffered from outermost judgement area
In, it is often little for the requirement of result set data object number, 10 results are such as returned to, as long as therefore most inquiry outermost areas
Data set can in domain, therefore 2/3 hunting zone is at least reduced, filter out a large amount of extraneous datas.Therefore, this hair
Bright method significantly improves the top-k query performances under mass data, improves magnanimity higher dimensionality and is inquired about according to the top-k of collection
Speed.
Brief description of the drawings
Fig. 1 is the present invention to data partition method schematic diagram;
Fig. 2 is the present invention to data subdividing method schematic diagram;
Fig. 3 represents that three kinds of different pieces of information dividing modes contrast with data dimension difference query time, wherein DistImprove
It is the inventive method, AngleDistTop_k is to be based on angle and distance data dividing method, and BasicTop_k is to original number
According to collection without dividing query time;
Fig. 4 is represented in the case of dimension 4, is contrasted for different user input weight query time.
Fig. 5 is speed-up ratio of the inventive method in different cluster nodes;
Fig. 6 is top-k query process figures in top-k inquiries specific implementation of the present invention.
Embodiment
The present invention is further described below in conjunction with the accompanying drawings.
Experiment is completed on the spark clusters of 7 nodes, and spark is built on hadoop, uses hadoop
Yarn explorers and HDFS document storage systems.Master nodes not only as Driver nodes but also are done in 7 nodes
Worker nodes, remaining 6 node is worker nodes.The basic configuration of experimental situation such as table 2 below:
Table 2
Using uniform data set, every record has 8 attributes, the integer between each attribute span [0,1000],
1,000,000,000 records, about 40G data volumes are collectively generated.The also dimension of generation 4,6 dimension data collection, and similar with 8 dimensions simultaneously, all
Data set is randomly generated, and is all 1,000,000,000 records.
It is to take average in weight if experiment is without specified otherwise below, the experiment done under conditions of k=100, and often
Secondary inquiry is all to have done 10 results averaged.Because the data prediction of inquiry is only used as once, then inquiry every time
Without considering data prediction, therefore hereafter query time does not more count the substantial amounts of data prediction time.Present invention side
Method is about 42mins for 8 dimension data pretreatment times.
Present embodiment is as shown in fig. 6, be divided into two big steps:
Step 1:Data prediction.It is main according to set forth herein data dividing method raw data set is divided,
Different Block is divided into, mark is carried out to each Block, is then stored on HDFS disks.The HDFS disks in spark
Mainly it is made up of each worker nodes disk.Judge according to the data partitioning scheme in claim and inquiry data
Mode is inquired about.Specific division is as follows:
The first step:Overall segmentation from inside to outside
Whole data space is divided into m=3 with this from inside to outside according to homalographic principle and is divided into 3 big subregions.Such as
Shown in Fig. 1, dividing mode is intuitively illustrated by taking two dimension as an example, transverse axis is that x-axis corresponds to attribute 1, the longitudinal axis is that y-axis corresponds to attribute 2, figure
In bold portion divide the space into 3 deciles, by each region from outside to inside serial number be 1,2,3.Except outermost one
Region, to remaining each region, the attribute using axle that belong to the region and each be the point of maximum as basic point, will be whole
The region that all properties value in coordinate system is both greater than the respective attributes value being equal at basic point marks off, according to from outside to inside
Serial number be A, B.N is judged successivelyA>=k, or nBThe no establishments of >=k, if square A data volumes nA>=k is set up, then only
With the data set in 1 region of inquiry, the data volume n otherwise checked in square BAWhether >=k sets up.Mass data is directed to look into
Ask generally square A data volumes only can be obtained by top-k results much larger than k with data in 1 region of inquiry, be
It is progressive that each big subregion is finely divided to reducing inquiry data volume.
Second step:The subdivision in each big region
Each subregion 1,2 is directed to further to divide, such as big subregion 1, (ABC), D, E can be divided into as shown in Figure 2
Three regions, wherein A, B, the trizonal areas of C are equal, and A, B, C are as must retrieve area, and D, E are as retrieval to be selected
Area.
Following truth also be present:If for a data object T1ScoreSo it is known that fW
(T1) and spatial data points T1In straight lineProjected length it is directly proportional, therefore can with projected length come
Weigh score function.
Thus, it is supposed that data set is more than or equal to k in square A, judge whether D, E need retrieval only to need in such a way
Checking:
If the weight of user's input meetsAnd w1=w2=0.5, compare the maximum boundary point d in D regions in figure
In straight lineOn the basic point a of subpoint to the distance between origin and square A arrived in the subpoint of above-mentioned straight line
The distance between origin, it can be found that the two is equal, similarly, the projection of the maximum boundary point e in E regions on above-mentioned straight line
Point to the distance between origin also with square A basic point a above-mentioned straight line subpoint to the equal of the distance between origin,
It is therefore not necessary to inquire about D and E regions, top-k results only are can be obtained by with inquiry A, B, C regions, and according to same reason
It can also be seen that the top-k results found in A, B, C region necessarily appear in dotted line top-right part in Fig. 2;
It is similar with above-mentioned principle, if w1> w2, then inquiry D regions are not had to;If w1< w2, then without query region E,
Do not prove one by one herein;
To sum up, equation below can be obtained:
D dimension datas space is generalized to, uses Si, 1≤i≤3 represents one in the 3 big subregion that divides from inside to outside;Sij,
1≤j≤d represents big subregion SiIn be similar to D or E j-th of subregion;Si(d+1)Represent big subregion SiIn it is similar be A, B, C
Subregion.When the weight of data object jth attributeWhen, then for big subregion SiOnly need to inquire about 2 therein
Region is respectively Si(d+1)And Sij;Otherwise it is directed to big subregion SiOnly with one region S of inquiryi(d+1)。
Step 2:Query processing.For one inquiry f (W, k) of user, inputted, chosen according to inquiry on Driver nodes
Partial data collection is inquired about.It takes k=100 to the present embodiment, and each data object attribute weight takes average, now only with looking into
Ask S1(d+1)Data area.
Such as Fig. 3, represent that three kinds of different pieces of information dividing modes contrast with data dimension difference query time, wherein
DistImprove is the inventive method, and AngleDistTop_k is to be based on angle and distance data dividing method, BasicTop_k
It is without dividing query time to raw data set;From experiment it can be seen that the inventive method ratio is based on angle and distance data
Dividing method more improves inquiry velocity, and inquiry velocity improves about 15%, and as dimension increases query time
It is steadily to increase, larger fluctuation does not occur.
Due to the weight W=(w of user1,w2,…,wd) input can influence inquire about data area size, such as Fig. 4 institutes
Show, the query time of different weighted values under 4 dimensions, the wherein first kind be with a certain attribute weight is extremely inclined to the characteristics of,
Including W2=(0.06,0.06,0.07,0.8) and W4=(0.56,0.14,0.25,0.03), the second class W1=(0.25,0.25,
0.25,0.25) it is equivalent weight, the 3rd class W3=(0.16,0.32,0.34,0.18) is deviation unobvious weight.W in figure1With
W3Query time is roughly equal, W2With W4Query time is about the same, and W1With W3Query time compares W2With W4Query time is short, main
If due to concentrating different weights to result in the need for inquiring about data block difference, W in low-dimensional data2With W4To be extremely inclined to some category
Property weight, more some data blocks of inquiry are resulted in the need for, so as to cause query time to be more than w1With w3。
The scalability of the inventive method can be seen as shown in figure 5, speed-up ratio in 8 dimension data collection on different nodes
Go out speed-up ratio close to preferable speed-up ratio, as processor is worker doubles, performing speed can also double, i.e., number of the present invention
It is with good expansibility according to division methods.
Described above is only the preferred embodiment of the present invention, it should be pointed out that:For the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (2)
1. one kind is based on magnanimity data weighting top-k querying methods under distributed computing framework, it is characterised in that:Including sequentially holding
Capable following steps:
Step 1, establish data space
The property value of all data objects comprising d attribute is converted to nonnegative value first, and property value is returned
One change is handled;D dimension coordinates system is established, the axle of coordinate system and the attribute of data object correspond, by all Data object placements
Data space is formed in coordinate system;
Step 2, data division
Using the origin of coordinate system as starting point, whole coordinate system is divided into m region from inside to outside, by each region from outside to inside
Serial number is 1,2 ... ..., m, and the border in the 1st region and reference axis cooperatively all include all data objects,
To any one region, the maximum of every attribute in the region is identical, and the coordinate of the peripheral boundary in each region meet to
The coordinate value of a rare axle is the maximum of the attribute in the region, is setting the maximum of the attribute in the 1st region as a1Before
Put, then the maximum of the attribute of ith zoneExcept one area of outermost
Domain, to remaining each region, the attribute using axle that belong to the region and each be the point of maximum as basic point, will entirely sit
The region that all properties value in mark system is both greater than the respective attributes value being equal at basic point marks off, according to from outside to inside
Serial number is 1,2 ... ..., M, wherein M=m-1, using the above-mentioned new region come of marking off as judging that area proceeds as follows
Judge:
According to number order from small to large, N is judged successivelyiWhether >=k sets up, wherein NiThe data object in area is judged for No. i
Number, k be algorithm return result set in data object number;Judge that region meets that above formula is set up when No. i, then terminate to sentence
It is disconnected, and i region from outside to inside is scanned for as region of search.
2. according to claim 1 be based on magnanimity data weighting top-k querying methods under distributed computing framework, its feature
It is:To being finely divided as a region of most inner side in region of search, the zone number is i, and divided method is as follows:
It is d+1 blocks by the region division, wherein d blocks are search domain to be selected, and remaining region in addition to search domain to be selected is must
Search domain;
It is n=1,2 ... ..., d by search domain to be selected numbering, wherein any one data point in n-th of search domain to be selected
Tnj(tn1,tn2,…,tnd), t herenjRepresent data point TnjJ axles corresponding to property value, tnjMeet following 2 inequality:
0≤tnj≤2ai+1-ai, 1≤j≤d and j ≠ n (1) here
ai-ai+1≤tnj≤ai, j=n (2) here
In n-th of search domain to be selected, if data point Tnj(tn1,tn2,…,tnd) meet that property value corresponding to one of axle is
aiAnd property value corresponding to remaining axle is 2ai+1-ai, then the maximum boundary point using the data point as n-th of search domain to be selected;
Traverse user is given to each attribute weight w at the maximum boundary point of each search domain to be selectedjWhether following bar is met
Part:
If in the presence of the attribute weight w for the maximum boundary point for meeting above-mentioned conditionj, then region of search range shorter is to include from extroversion
Area and the search domain to be selected belonging to the maximum boundary point must be retrieved in interior i-1 region, ith zone;
If in the absence of the attribute weight w for the maximum boundary point for meeting above-mentioned conditionj, then region of search range shorter is to include from outer
Area must be retrieved in i-1 inside region and ith zone.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510209691.0A CN104809210B (en) | 2015-04-28 | 2015-04-28 | One kind is based on magnanimity data weighting top k querying methods under distributed computing framework |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510209691.0A CN104809210B (en) | 2015-04-28 | 2015-04-28 | One kind is based on magnanimity data weighting top k querying methods under distributed computing framework |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104809210A CN104809210A (en) | 2015-07-29 |
CN104809210B true CN104809210B (en) | 2017-12-26 |
Family
ID=53694032
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510209691.0A Active CN104809210B (en) | 2015-04-28 | 2015-04-28 | One kind is based on magnanimity data weighting top k querying methods under distributed computing framework |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104809210B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106777091A (en) * | 2016-12-14 | 2017-05-31 | 大连大学 | The double filtering searching systems of the Skyline based on many medical factors under mobile O2O environment |
CN106777095A (en) * | 2016-12-14 | 2017-05-31 | 大连交通大学 | The double filtering search methods of the Skyline based on many medical factors under mobile O2O environment |
CN108491541A (en) * | 2018-04-03 | 2018-09-04 | 哈工大大数据(哈尔滨)智能科技有限公司 | One kind being applied to distributed multi-dimensional database conjunctive query method and system |
CN110245022B (en) * | 2019-06-21 | 2021-11-12 | 齐鲁工业大学 | Parallel Skyline processing method and system under mass data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102314521A (en) * | 2011-10-25 | 2012-01-11 | 中国人民解放军国防科学技术大学 | Distributed parallel Skyline inquiring method based on cloud computing environment |
CN103177130A (en) * | 2013-04-25 | 2013-06-26 | 苏州大学 | Continuous query method and continuous query system for K-Skyband on distributed data stream |
-
2015
- 2015-04-28 CN CN201510209691.0A patent/CN104809210B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102314521A (en) * | 2011-10-25 | 2012-01-11 | 中国人民解放军国防科学技术大学 | Distributed parallel Skyline inquiring method based on cloud computing environment |
CN103177130A (en) * | 2013-04-25 | 2013-06-26 | 苏州大学 | Continuous query method and continuous query system for K-Skyband on distributed data stream |
Non-Patent Citations (3)
Title |
---|
Multi-dimensional top-k dominating queries;Man Lung Yiu 等;《The VLDB Journal》;20090630;第18卷(第3期);全文 * |
Top-k Dominant Web Services Under Multi-Criteria Matching;Dimitrios Skoutas 等;《EDBT’09 Proceedings of the 12th International Conference on Extending Database Technology:Advances in Database》;20090326;全文 * |
度量空间中的Top-k反向Skyline查询算法;张彬 等;《计算机研究与发展》;20140315;第51卷(第3期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN104809210A (en) | 2015-07-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200250163A1 (en) | Index Sharding | |
Kim et al. | Taming subgraph isomorphism for RDF query processing | |
EP3014488B1 (en) | Incremental maintenance of range-partitioned statistics for query optimization | |
US10162857B2 (en) | Optimized inequality join method | |
US7877376B2 (en) | Supporting aggregate expressions in query rewrite | |
US10565201B2 (en) | Query processing management in a database management system | |
CN106874426B (en) | RDF (resource description framework) streaming data keyword real-time searching method based on Storm | |
JP6112440B2 (en) | Data partitioning method and apparatus | |
CN104809210B (en) | One kind is based on magnanimity data weighting top k querying methods under distributed computing framework | |
US9110949B2 (en) | Generating estimates for query optimization | |
EP2469423B1 (en) | Aggregation in parallel computation environments with shared memory | |
CN110909111B (en) | Distributed storage and indexing method based on RDF data characteristics of knowledge graph | |
CN102270232A (en) | Semantic data query system with optimized storage | |
Chen et al. | Metric similarity joins using MapReduce | |
Qi et al. | The min-dist location selection and facility replacement queries | |
Tang et al. | An intermediate data partition algorithm for skew mitigation in spark computing environment | |
US9378243B1 (en) | Predicate-based range set generation | |
Yin et al. | Efficient distributed skyline computation using dependency-based data partitioning | |
Aluç et al. | chameleon-db: a workload-aware robust RDF data management system | |
do Carmo Oliveira et al. | Set similarity joins with complex expressions on distributed platforms | |
Zheng et al. | User preference-based data partitioning top-k skyline query processing algorithm | |
Zhu et al. | A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data | |
Pan et al. | Garden: a real-time processing framework for continuous top-k trajectory similarity search | |
Shi et al. | HEDC++: an extended histogram estimator for data in the cloud | |
Lu | Dynamic matrix clustering method based on time series |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |