WO2019056570A1 - 滑动窗口下基于位置top-k关键词查询的快速索引方法及系统 - Google Patents

滑动窗口下基于位置top-k关键词查询的快速索引方法及系统 Download PDF

Info

Publication number
WO2019056570A1
WO2019056570A1 PCT/CN2017/113483 CN2017113483W WO2019056570A1 WO 2019056570 A1 WO2019056570 A1 WO 2019056570A1 CN 2017113483 W CN2017113483 W CN 2017113483W WO 2019056570 A1 WO2019056570 A1 WO 2019056570A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
node
score
quadtree
data
Prior art date
Application number
PCT/CN2017/113483
Other languages
English (en)
French (fr)
Inventor
毛睿
李荣华
陆敏华
王毅
罗秋明
商烁
刘刚
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Publication of WO2019056570A1 publication Critical patent/WO2019056570A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Definitions

  • the invention belongs to the field of computers, and particularly relates to an indexing method, in particular to a fast indexing method based on a location top-k keyword query in a sliding window.
  • the present invention also relates to a fast indexing system based on location top-k keyword query under a sliding window.
  • Such messages can be modeled as geotext data streams that typically provide first-hand information for various local times of different types and sizes, including news stories of a region, urban disasters, local business promotions, and public interest in the city. Hot topics, etc.
  • Location-based social media data streams have the following properties: (1) bursty nature—if users find data fast enough, some messages about a particular topic are quickly buried deep into the data stream [Ozsoy, Makbule Gulcin, Kezban Dilek Onal, and Ismail Sengor Altingovde.Result diversification for tweet search.In WISE, 2014.]; (2) The essence of partial intentions - users from different locations may post messages related to different topics [Kaiqi Zhao, Lisi Chen, And Gao Cong.Topic exploration in spatio-temporal document collections.In SIGMOD, 2016.]. Every second of location-based social media generates thousands of messages, so it's important to maintain a summary that is in the user's mind.
  • the present invention contemplates a new type of top-k query, location-based top-k keyword query (LkTQ), which returns top-by considering the word frequency and location proximity of the geotext data on the sliding window. k some of the most popular keywords.
  • Figure 1 provides a simple example of LkTQ.
  • the points with square labels represent the query location.
  • the point with the circle label is the address location of the tweet, which is a geotext message.
  • the result of LkTQ is the top k most popular keywords based on location-aware frequency scores, as shown in Figure 1(b).
  • the score of a word is calculated by a linear combination of the keyword frequency and the proximity of the distance between the message containing the word and the query point.
  • top-k spatial keyword queries eg [G.Cong, CSJensen, D. Wu. E_cient retrieval of the top-k most relevant spatial web objects. PVLDB, 2009.], [IDFelipe, V.Hristidis , and N.Rishe.Keyword search on spatial databases. In ICDE, 2008.], etc.
  • the hybrid index is used to store the location and textual information of the object, using location information and textual information to prun the search space during the query. Most of this The index combines a spatial index (for example, an R-tree, a quadtree) with an inverted file of storage locations and text information.
  • these studies are all aimed at retrieving top-k spatial text objects, which is different from the problem of retrieving top-k keywords.
  • Skovsgaard A. Skovsgaard, D. Sidlauskas, CS Jensen. Scalable top-k spatio-temporal term querying. In ICDE, 2014.
  • AFIA adaptable frequent item aggregator
  • This system is implemented by dividing the space into multiple granularities through a multi-layered grid. A pre-computed summary is saved in each grid unit. The system also uses a checkpoint to prevent a counter from entering the top-k counter along with its error. Because the independent system makes use of the space-time index.
  • BlogScope [N.Bansal and N.Koudas.Blogscope: a system for online analysis of high volume text streams. In VLDB, 2007.] is a system for mobile news, mailing lists, blogs and other information. It supports the discovery and tracking of real-world entities (story, events, etc.). Monitor most popular keywords and bursts of time or space. The biggest flaw of BlogScope is that it can't aggregate keywords according to user-specified space-time regions. In addition, it has a very weak timeliness and usually only supports searches within a few minutes.
  • NewsStand [BETeitler, MDLieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. Newsstand: a new view on news. In GIS, 2008.] and TwitterStand [J. Sankaranarayanan, H. Samet, BETeitler, MDLieberman, and J. Sperling. Twitterstand: news in tweets. In GIS, 2009.] are two similar systems.
  • NewsStand is a news aggregator of spatial text data, extracting geographic content from RSS feeds to a collection of stories. The user is expected to search and find some stories related to the query keywords within the geographic area.
  • TwitterStand uses tweets as a source of data rather than RSS feeds. They all use a spatial text search engine that supports time-space searches on a small ProMED dataset for a short time. However, neither system has a good update rate.
  • the technical problem to be solved by the present invention is to provide a fast indexing method based on location top-k keyword query under sliding window, which can effectively reduce cost and improve query speed, and can effectively trim search space according to word frequency and location proximity simultaneously. And can handle geotext data streams with high arrival rates.
  • the present invention also provides a fast indexing system based on location top-k keyword query under the sliding window.
  • the invention provides a fast indexing method based on a location top-k keyword query under a sliding window, comprising constructing a data indexing model phase and a query phase;
  • the step of constructing the data index model specifically includes the following steps:
  • Step one determining a geographic range covered by the quadtree and a node splitting rule
  • Step 2 accept the data stream and insert data into the node
  • Step 3 The node that conforms to the step-node splitting rule splits, and the data insertion continuously generates a complete quadtree;
  • Step 4 For each leaf node, count the word frequency and store the inverted index
  • Step 5 store, for each non-leaf node, MG aggregation summary information of all its child nodes;
  • Step 6 For the data insertion process of step four and step five, in this process, the size of the sliding window needs to be maintained, the data item with the oldest timestamp is deleted, the latest data is added, and the index of the quadtree is adjusted. structure;
  • the query phase specifically includes the following steps:
  • the constructed quadtree and query node and k create a list as a result set, initialize to null; k represents the number of result keywords that the user can specify;
  • the pruning operation is performed according to the MG digest of the root node of the constructed quadtree and k, and the candidate result is obtained. set;
  • a maximum heap C is used to store each word in the candidate result set and its score;
  • C is a priority queue in which all candidate words are stored;
  • the words of the queue header in C are sequentially extracted, and the root node is traversed to the leaf node, and each layer is replaced with a value smaller than the original value to replace the original value until Traversing to the leaf node to find the exact score of the word, put it into the queue;
  • the fourth step of the loop when the score of the word at the head of the queue is equal to the exact score of the word at the leaf node, is placed in the result set;
  • the result set is returned.
  • the geographical range of the quadtree coverage is determined to be a latitude coordinate of the upper left corner and the upper right corner.
  • the determining node splitting rule is: setting a data item in each leaf node not to exceed a certain threshold M, and if it is exceeded, splitting into four leaf nodes. Or directly limit the depth of the tree.
  • each leaf node stores a summary of all text information in the included message; the calculation process algorithm using the MG summary information in the step is:
  • k represents the number of result keywords that the user can specify
  • an MG digest stores k-1 ⁇ items, number> pairs, for each new item i in the data stream has the following three The situation is handled separately:
  • step 5 the aggregation process of the MG aggregation summary information is:
  • step 6 if the sliding window is not full, when a new message arrives and is inserted into the leaf node of the quadtree, the summary of the node is also updated; then, Its parent node will also update its merged summary; this process will continue to iterate until the root node of the quadtree gets the latest aggregate summary information; if the sliding window is full, when the data stream comes in a new one The information is also inserted, so the information with the oldest timestamp will be deleted; then, the index update process is the same as when the sliding window is not full.
  • the pruning operation process is as follows: after obtaining the exact k value from the user input, recalculating the score of the kth word, and the distance portion in the score "Set the score calculated as 0 as a lower bound; then, starting from the (k+1)th word in the root node summary, recalculate the "distance portion" of the words, using the largest distance to calculate as the upper bound; When the upper bound score of the i-th (i>k) word is still smaller than the lower bound score of the k-th word, it is determined that the word after the i-th can not reach the top of the priority queue in the next future k operations.
  • the score is calculated as follows:
  • Equation (1) Calculate the score using the digest stored in each node: Equation (1) defines the formula for calculating the score.
  • D be a two-dimensional European space
  • W be a sliding window
  • S is a collection of geographic text information in D and W
  • pos is D
  • text is text information
  • o is D
  • pos is D
  • t is text information
  • W defines the position-aware word frequency score of a word t in the sliding window W:
  • freq(t) is the number of pieces of information containing the word t
  • is the total number of pieces of information in the sliding window
  • d(q, W t ) is the distance of the query point q from the information containing t in the window W.
  • d diag is the diagonal length of the rectangular region R
  • represents the number of information containing the word t in W
  • is a parameter that balances the weight between the word frequency and the positional proximity.
  • the value is essentially a linear combination of the word frequency of the word in W and the distance between the word and the query point q; the calculation formula of the score is divided into "frequency part” And the "distance part” Since the MG digest estimates the frequency of any term with a maximum error of n/(k+1), n is the number of all messages, and this maximum error is added to freq to calculate the "frequency portion"; d(q, W t ) is the sum of the distance between the information containing the word t and the query point, using the minimum distance of the four points of the node containing the word from the query point as an upper bound; the "distance part” calculation is considered for the same word Redundancy calculation, which includes a division operation on the number of information appearing in the same word in a node, and calculates the sum of the two parts by a linear weight parameter ⁇ , normalizing it to the interval of [0, 1];
  • the score of the word needs to be integrated to calculate the score of the word in the whole tree; this step is to score the word in some nodes. Adding together makes the score as large as possible. In the process, one rule must be observed that these nodes must cover the entire quadtree.
  • the words of the queue header in the C are words that currently have the largest score.
  • the present invention also provides a fast indexing system based on location top-k keyword query under sliding window, comprising constructing a data index model module and a query module;
  • the construction data index model module includes a quadtree geographic range and a split rule determination unit, a data insertion unit, and a quadtree adjustment unit;
  • the data insertion unit includes a leaf node storing an inverted index, and a non-leaf node storing the child node thereof MG aggregation summary;
  • the quadtree adjustment unit includes a sliding window to insert new data, and delete data having the oldest timestamp;
  • the query module includes an initialization result set unit, a pruning operation unit, and a priority queue storage result unit;
  • the initialization result set unit is configured to input the constructed quadtree and the query node and k, establish a list as a result set, and initialize Empty, k represents the number of result keywords that the user can specify;
  • the pruning operation unit is configured to perform a pruning operation according to the MG digest and k of the root node of the constructed quadtree to obtain a candidate result set, and cut
  • the branch operation includes a calculation substitution of the upper limit of the distance portion calculated according to the score, narrowing the calculation range, and ensuring that k keywords can be returned;
  • the priority queue storage result unit includes a word starting with the largest score in the priority queue, starting from the root node Start traversing until the exact score is found in the leaf node. The exact value is put into the queue and repeated until the first k words in the priority queue no longer change.
  • the present invention has the following beneficial effects:
  • the present invention defines a new problem of processing LkTQ by looking up the word frequency and location proximity of a geotext data set to find the top-k local most popular keywords.
  • the present invention proposes a hybrid quadtree index structure with low storage and update cost and a search algorithm with an effective pruning strategy, enabling fast and accurate top-k keyword search.
  • the present invention adds a summary file to each node of the quadtree to store a summary of word frequencies.
  • a non-leaf node maintains an upper bound error by storing a merged summary of its child nodes.
  • the present invention has a large number of merge operations in the quadtree node, and the merge operation using the MG digest is lightweight and guaranteed for the accuracy of the frequency.
  • the invention can effectively reduce the cost and improve the query speed, and can simultaneously according to word frequency and location proximity Effectively pruning the search space and being able to process geotext data streams with high arrival rates.
  • the method of the present invention is more accurate than the existing reference method.
  • the target k is set to a small value, our algorithm has very accurate results and can guarantee 80% accuracy.
  • FIG. 1 is a schematic diagram of a query example of a location-based top-k keyword query (LkTQ) in a Chinese region; wherein FIG. 1(a) represents information and distance; FIG. 1(b) shows a tag cloud.
  • LkTQ location-based top-k keyword query
  • FIG. 2 is a flow chart of a fast indexing method based on a position top-k keyword query under the sliding window of the present invention.
  • FIG. 3 is a schematic diagram showing the basic structure of an index model of a quadtree according to the present invention.
  • FIG. 4 is a schematic diagram of a framework of a fast indexing system based on a location top-k keyword query in a sliding window of the present invention
  • FIG. 5 is a schematic diagram showing a comparison of time consumption results of updating an index under different data amounts in the experiment of the present invention
  • FIG. 6 is a schematic diagram showing a comparison of results of changing the information capacity in a quadtree node in the experiment of the present invention; wherein FIG. 6(a) is a schematic diagram showing the time cost comparison result when the number of data sets is 10,000; FIG. 6(b) is a change A comparison of the time cost results of the amount of data in the sliding window.
  • FIG. 7 is a schematic diagram showing a comparison result of changing the target k value in the experiment of the present invention
  • FIG. 7(a) is a schematic diagram showing a comparison result between the reference algorithm and the algorithm of the present invention in time cost
  • FIG. 7(b) is a change sliding window for Schematic diagram of time cost comparison results after changing the k value of different data volume sizes
  • FIG. 7(c) is a schematic diagram showing the comparison results of the number of candidate words before and after pruning for k values under different data amounts of the sliding window;
  • Figure 8 is a graph showing the results of the comparison between the algorithm of the present invention and the benchmark algorithm in the experiment of the present invention.
  • D be a two-dimensional European space
  • W be a sliding window
  • S is a collection of geographic text information in D and W.
  • An LkTQ q consists of a tuple (loc, k), where loc represents the query location point and k represents the number of result keywords that the user can specify.
  • loc represents the query location point
  • k represents the number of result keywords that the user can specify.
  • the position-aware word frequency score of a word t in the sliding window W is defined as a linear combination of the word frequency of the word in W and the distance between the word and the query point q:
  • freq(t) is the number of pieces of information containing the word t
  • is the total number of pieces of information in the sliding window
  • d(q, W t ) is the information of the query point q and the sliding window W containing t
  • d diag is the diagonal length of the rectangular region R
  • represents the number of information including the word t in W
  • is a parameter that balances the weight between the word frequency and the positional proximity.
  • the counter-based method uses a fixed-size counter to store all items, and each message is stored in an independent counter that is a subset of S. When an item in the management set appears again, its counter is updated. If this item is not in the management set and the counter is full, then this situation will be handled differently in different algorithms. For example, the Space-Saving algorithm finds the item with the smallest counter value, replaces it with a new item, and then increments the counter of the new item by one.
  • an MG digest stores k-1 (items, number) pairs, and each new item i in the data stream is processed in the following three cases:
  • the sketch-based approach manages all sets of information through a hash method rather than just managing a subset of the information.
  • the information is hashed into the counter space, and the hashed counter will be updated each time a corresponding item is hit.
  • the CountSketch algorithm [M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, 2002.] solves the problem of finding a keyword that approximates top with a 1- ⁇ success probability.
  • GroupTest algorithm [G.Cormode and S.Muthukrishnan.What's hot and what's not:tracking most frequent items dynamically.TODS, 2005.] aims to search for queries about frequent items and achieve a failure of a constant probability ⁇ .
  • the fast indexing method based on the location top-k keyword query in the sliding window of the present invention includes the following steps:
  • the present invention uses a quadtree-based index structure to store geographic text information for searches in the stream.
  • the basic idea of a quadtree is to divide the underlying space into units of different levels. In addition, it iteratively divides the space into four congruent subspaces until the tree reaches a certain depth or reaches a certain stopping condition.
  • Quadtrees are widely used in image processing, spatial data indexing, fast collision detection in two-dimensional environments, sparse data, and so on.
  • the basic structure of the index model of the quadtree of the present invention is shown in FIG.
  • the different shape identifiers of the nodes correspond to each of the right rectangles and are split into four identical quadrants with the same shape as the center point (each of each quadrant is a node), the root The node (the triangle node in the figure) represents the entire rectangular area. Stored in the leaf node is the inverted index, and the non-leaf node stores the merged summary of yes.
  • the quadtree has a very simple structure, and it has relatively high insertion and update efficiency when spatial text information is relatively consistently distributed.
  • the black dots in Figure 3 are the information at the locations where they were accurately published.
  • M we set M to the maximum number of pieces of information stored in a leaf node. In other words, if the number of pieces of information stored in a leaf node exceeds M, the node becomes a non-leaf node and splits into four leaf node units of the same size.
  • the construction of the data index model specifically includes the following steps:
  • the present invention adopts The data item in each leaf node is set to not exceed a certain threshold M. If it is exceeded, it is split into four leaf nodes, and the depth of the tree can also be directly defined;
  • each leaf node of the quadtree (ie, the leaf node) stores a summary of all the text information in the contained message.
  • Algorithm 1 The calculation process algorithm (called Algorithm 1) of the MG summary information is:
  • k represents the number of result keywords that the user can specify
  • an MG digest stores k-1 ⁇ items, number> pairs, for each new item i in the data stream has the following three The situation is handled separately:
  • both the leaf node and the non-leaf node store a summary of the message.
  • the digest is calculated by the procedure in Algorithm 1 above, but in the non-leaf nodes (ie, non-leaf nodes), the digest is merged from the MG digest merge method.
  • the present invention employs the MG digest instead of the SS digest.
  • the process of merging MG summaries is also very simple.
  • the aggregation process of MG summary information is:
  • This step will generate up to 2k counters. This is followed by a pruning operation: the values in the 2k counters are arranged in ascending order, the (k+1)th counter is taken, and the value of this counter is subtracted from all counters. Finally, we remove all non-positive counters. Obviously this is a very efficient process: this aggregation process can be done with a constant number of sort operations and a scan of the O(k) complexity summary.
  • D be a two-dimensional European space
  • W be a sliding window
  • S is a collection of geographic text information in D and W.
  • freq(t) is the number of pieces of information containing the word t
  • is the total number of pieces of information in the sliding window
  • d(q, W t ) is the distance of the query point q from the information containing t in the window W.
  • d diag is the diagonal length of the rectangular region R
  • represents the number of information including the word t in W
  • is a parameter that balances the weight between the word frequency and the positional proximity.
  • the score is essentially a linear combination of the word frequency of the word in W and the distance between the word and the query point q.
  • Equation (1) defines the formula for calculating the score. For the way calculation, we divide the calculation formula of the score into "frequency part” And the "distance part” Essentially, this score is a linear combination of the two parts. Since the MG digest estimates the frequency of any term (n is the number of all messages) with a maximum error of n/(k+1), we add this maximum error to freq to calculate the "frequency portion". d(q, W t ) is the sum of the distance between the information containing the word t and the query point. Here, we use the minimum distance of the query point to the four sides of the node containing the word as an upper bound.
  • the "distance" section contains a division of the number of messages that appear for the same word in a node. Finally, we calculate the sum of the two parts by a linear weight parameter ⁇ and normalize it to the interval of [0,1].
  • is a parameter used to balance positional proximity and word frequency.
  • C is a priority queue that stores all candidate words.
  • To get the candidate words we extract the summary of the root node of the quadtree. However, if the candidate word is stored in many nodes, the number is large, but the user-specified k value is a small number, then a large number of word scores that calculate the useless result will cause an extra large amount of time cost. . Therefore, we came up with a pruning strategy that avoids unnecessary calculations without guaranteeing that any candidate words will be lost.
  • the pruning operation is as follows: After we get the exact k value from the user input, we recalculate the score of the kth word, and set the "distance part" in the score to 0 to calculate the score as a lower bound. Next, from the root node summary (k+1) words start (because the abstract is already sorted), we recalculate the "distance portion" of these words, using the largest distance to calculate as the upper bound. When the upper bound score of the i-th (i>k) word is still smaller than the lower bound score of the k-th word, then we conclude that the word after the i-th is in the next future k-operations in line 4-13 of algorithm 2 The process cannot also reach the top of the priority queue.
  • the present invention provides a fast indexing system based on a location top-k keyword query under a sliding window, including a data index model module and a query module;
  • the construction data index model module includes a quadtree geographic range and a split rule determination unit, a data insertion unit, and a quadtree adjustment unit;
  • the data insertion unit includes a leaf node storing an inverted index, and a non-leaf node storing the child node thereof MG aggregation summary;
  • the quadtree adjustment unit includes a sliding window to insert new data, and delete data having the oldest timestamp;
  • the query module includes an initialization result set unit, a pruning operation unit, and a priority queue storage result unit;
  • the initialization result set unit is configured to input the constructed quadtree and the query node and k, establish a list as a result set, and initialize Empty, k represents the number of result keywords that the user can specify;
  • the pruning operation unit is configured to perform a pruning operation according to the MG digest and k of the root node of the constructed quadtree to obtain a candidate result set, and cut
  • the branch operation includes a calculation substitution of the upper limit of the distance portion calculated according to the score, narrowing the calculation range, and ensuring that k keywords can be returned;
  • the priority queue storage result unit includes a word starting with the largest score in the priority queue, starting from the root node Start traversing until the exact score is found in the leaf node. The exact value is put into the queue and repeated until the first k words in the priority queue no longer change.
  • the data set containing the tweets was collected in the US region. There are a total of 20,000,000 pieces of data, each of which contains a timestamp, a list of words, and the longitude and latitude of the tweet posting (ie, the user-set geotag). Note that the results of each experiment were averaged over 10 different experiments performed on different query inputs.
  • the index structure used in the benchmark method is also based on a quadtree.
  • each leaf node of the quadtree we store the exact frequency of each word.
  • the frequency table in the corresponding node we need to iterate through the entire node until it reaches the leaf node. This method can return an accurate result of LkTQ. Therefore, it was used in our follow-up experiments Used as a measure of the accuracy of the query results.
  • constructing a quadtree involves computing and merging all word frequencies.
  • the build process involves computing MG digests for all nodes in the quadtree.
  • the process time consumption of constructing a quadtree of the method of the present invention is much greater than that of the benchmark method.
  • Figure 6 shows the results.
  • Figure 6(a) is a comparison result when the number of data sets is 10,000.
  • M ranges from 100 to 2000.
  • the method of the invention (LkTQ) is much faster than the baseline method. Change M to have a little fluctuation.
  • the information capacity of the leaf nodes of the quadtree does not have a significant impact on performance.
  • M is fixed, the tree is fixed and all scores can be calculated. However, in our algorithm, M is affecting performance. In theory, the larger M is, the smaller the depth of the quadtree is. Because, when calculating the score of each node, we use the nearest edge to the query point when calculating the "distance part". If the tree is deeper, the distance will be smaller and the number of leaf nodes will be The bigger.
  • Fig. 6(b) as M increases, the time consumption increases. As M gets bigger and bigger, the cost of splitting increases. When M is in the range of 300 to 500, the time consumption drops a little. In this range, there is the best performance
  • Figure 7 shows the results.
  • the range of the target k is set according to the general needs of the user.
  • the performance of the algorithm of the present invention is significantly better than the one calculated method (see Figure 7(a)).
  • the data set size in Figure 7(a) is 10,000, however, the benchmark method takes approximately 7 minutes to return the result.
  • the time of the benchmark method is consumed at a stable and inefficient level of approximately 400,000 ms.
  • the benchmark method has a slower speed. For example, to process 5,000 pieces of information, it takes nearly 12 million milliseconds, and processing 100,000 pieces of information takes nearly 60 million milliseconds, which is very inefficient. So we no longer compare the results without comparability.
  • is a weight parameter in a balanced score calculation formula. Changing the value of ⁇ is essentially the degree of influence of adjusting the distance and word frequency. This is determined by the user's preferences. Experiments show that the results of our algorithm are sensitive to ⁇ in the range of (0.9, 1.0). Of course, when ⁇ is set to 0 or 1, the result represents the unilateral influence of distance or word frequency. In particular, the sensitive range of alpha is affected by the distribution of the data set. However, our experimental results prove that by changing ⁇ , the algorithm is sensitive to the results, so it can meet the user's preference needs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种滑动窗口下基于位置top-k关键词查询的快速索引方法及系统,包括构建数据索引模型和查询;构建数据索引模型包括如下:确定四叉树覆盖的地理范围和节点分裂规则;接受数据流,向节点插入数据;符合分裂规则的节点分裂,数据插入生成完整四叉树;叶节点存储倒排索引;非叶节点存储其子节点的MG聚合摘要;调整四叉树结构;查询包括如下:初始化结果集;剪枝操作得到候选结果集;优先队列中取最大分值的词开始计算,从根节点开始遍历直到在叶节点找到其精确分值,放入队列,重复直到优先队列前k个词不再变化。本发明能有效降低成本、提升查询速度,可同时根据词频和位置邻近度有效修剪搜索空间,且能处理具有高到达率的地理文本数据流。

Description

滑动窗口下基于位置top-k关键词查询的快速索引方法及系统 技术领域
本发明属于计算机领域,具体涉及索引方法,尤其涉及一种适用于滑动窗口下基于位置top-k关键词查询的快速索引方法。此外,本发明还涉及一种滑动窗口下基于位置top-k关键词查询的快速索引系统。
背景技术
随着社交媒体、云存储和基于位置的服务的激增,包含文本和地理信息的讯息(例如,地理标签的推文)数量飙升。这样的消息,可以被建模为地理文本数据流通常能够为不同类型和规模的各种本地时间提供第一手信息,包括一个地区的新闻故事,城市灾难,当地商业促销以及城市中公众关注的热门话题等。
基于位置的社交媒体的数据流具有以下性质:(1)突发性质—如果用户不够快速地发现数据,一些关于特定主题的讯息会被很快深埋到数据流中[Ozsoy,Makbule Gulcin,Kezban Dilek Onal,and Ismail Sengor Altingovde.Result diversification for tweet search.In WISE,2014.];(2)局部意向的本质—来自不同地点的用户可能会发布与不同话题相关的讯息[Kaiqi Zhao,Lisi Chen,and Gao Cong.Topic exploration in spatio-temporal document collections.In SIGMOD,2016.]。每一秒钟基于位置的社交媒体生成的讯息成千上万条,因此维护占据用户心目中的总结非常重要。
为了解决这个问题,现有的提案[A.Skovsgaard,D.Sidlauskas,C.S.Jensen.Scalable top-k spatio-temporal term querying.In ICDE,2014.]旨在在用户指定的时空区域内找到内容中局部最流行的前k个关键词。然而,在大多数情况下,用户在空间域上指定举行区域是困难的。相反,一个用户也许更倾向同时考虑词频和位置邻近度的排序列表。
基于用户需求,本发明考虑了一种新型的top-k查询,基于位置的top-k关键词查询(LkTQ),通过考虑在滑动窗口上的地理文本数据的词频和位置邻近度来返回top-k局部最流行的关键词。
图1提供了LkTQ的一个简单例子。我们考虑在中国地图上的10个带有地理标签的推文。如图1(a)所示,带有正方形标签的点代表查询位置。带有圆圈标签的点是推文的地址位置,也就是地理文本消息。对于每个地理文本信息,我们标出了其文本信息及其到查询点的距离。LkTQ的结果是基于位置感知频率分值的前k个局部最流行的关键词,如图1(b)所示。一个词的分值是通过关键词频率与包含该词的消息与查询点之间的距离邻近度的线性组合来计算的。
解决LkTQ问题的一个直接方法是评估当前滑动窗口内消息的所有词。具体来说,对于每一个这样的词,我们计算其余查询点之间的位置感知频率分值。然而,这种方法对于大量的地理文本消息将是非常昂贵的。为了有效地处理LkTQ,我们需要应对以下挑战。首先,返回LkTQ的确切结果在计算上是非常昂贵的。因此,我们需要寻求高精度的近似解。第二,位置感知频率分值以连续的方式衡量词频和位置邻近度。因此,提出一种混合索引结构及其对应的算法,使得可以同时根据词频和位置邻近度有效地修剪搜索空间是非常有意义的。第三,由于LkTQ的滑动窗口场景,索引机制必须能够处理具有高到达率的地理文本数据流。
现有的top-k空间关键词查询(比如[G.Cong,C.S.Jensen,D.Wu.E_cient retrieval of the top-k most relevant spatial web objects.PVLDB,2009.]、[I.D.Felipe,V.Hristidis,and N.Rishe.Keyword search on spatial databases.In ICDE,2008.]等)通过考虑(到查询位置)位置邻近度和(到查询关键字)文本相似度返回k个最相关的控件文本对象。混合索引被用来存储对象的位置和文本信息,在查询过程中使用位置信息和文本信息来修剪搜索空间。大多数这样的 索引分别将空间索引(比如,R树,四叉树)与存储位置和文本信息的倒排文件结合起来。然而,这些研究都旨在检索top-k空间文本对象,与检索top-k关键词的问题不同。
现在也出现了一些使用相关技术来开发的系统。Skovsgaard[A.Skovsgaard,D.Sidlauskas,C.S.Jensen.Scalable top-k spatio-temporal term querying.In ICDE,2014.]设计了一个支持索引、更新和查询操作的框架,能够返回在一个用户定义的时空区域内的top-k个关键词。这个额系统叫做可适应的频繁项聚合器(AFIA)。这个系统是通过多层的网格将空间分成多粒度来实现的。在每一个网格单元中会保存一个预先计算的摘要。该系统还使用了一个检查点来防止一个计数器与它的误差一起进入top-k计数器的情况。因为独立系统利用了时空索引。
BlogScope[N.Bansal and N.Koudas.Blogscope:a system for online analysis of high volume text streams.In VLDB,2007.]是一个手机新闻、邮件列表、博客等信息的系统。它支持发现和跟踪现实世界的实体(故事、事件等)。监控大多数热门关键词以及时间或者空间的突发。BlogScope的最大的缺陷是它不能根据用户指定时空区域来聚合关键词。此外,它有很弱的时效性,通常只能支持几分钟内的搜索。
NewsStand[B.E.Teitler,M.D.Lieberman,D.Panozzo,J.Sankaranarayanan,H.Samet,and J.Sperling.Newsstand:a new view on news.In GIS,2008.]和TwitterStand[J.Sankaranarayanan,H.Samet,B.E.Teitler,M.D.Lieberman,and J.Sperling.Twitterstand:news in tweets.In GIS,2009.]是两个相似的系统。NewsStand是一个空间文本数据的新闻聚合器,摘录来自RSS种子的地理内容到故事集中。用户被期望搜索和查找在地理区域内与查询关键词相关的一些故事。NewsStand和TwitterStand的区别在于TwitterStand是使用推文作为数据来源,而不是RSS种子。它们都采用了一个空间文本搜索引擎,支持在一个小型ProMED数据集上不长时间的时空搜索。然而,两个系统都没有很好的更新速率。
因此,亟需研发一种能解决上述技术问题的滑动窗口下基于位置top-k关键词查询的快速索引方法及系统。
发明内容
本发明要解决的技术问题在于提供一种滑动窗口下基于位置top-k关键词查询的快速索引方法,其能有效降低成本、提升查询速度,可以同时根据词频和位置邻近度有效地修剪搜索空间,且能够处理具有高到达率的地理文本数据流。为此,本发明还提供该滑动窗口下基于位置top-k关键词查询的快速索引系统。
为解决上述技术问题,本发明采用如下技术方案:
本发明提供一种滑动窗口下基于位置top-k关键词查询的快速索引方法,包括构建数据索引模型阶段以及查询阶段;
所述构建数据索引模型阶段具体包括如下步骤:
步骤一,确定四叉树覆盖的地理范围以及节点分裂规则;
步骤二,接受数据流,向节点中插入数据;
步骤三,符合步骤一节点分裂规则的节点分裂,数据插入不断生成完整的四叉树;
步骤四,对每一个叶节点,统计其词频,存储倒排索引;
步骤五,对每一个非叶节点,存储其所有子节点的MG聚合摘要信息;
步骤六,针对步骤四和步骤五两步的数据插入过程中,在这个过程中需要维护滑动窗口的大小,删掉具有最旧时间戳的数据项,添加最新的数据,调整四叉树的索引结构;
所述查询阶段具体包括如下步骤:
第一步,输入构建好的四叉树和查询节点以及k,建立一个列表作为结果集,初始化为空;k表示用户可指定的结果关键词的个数;
第二步,根据构建好的四叉树的根节点的MG摘要以及k进行剪枝操作,得到候选结果 集;
第三步,使用一个最大堆C存储候选结果集中的每个词语以及其分值;C是存储所有候选词的一个优先队列;
第四步,当结果集的大小小于k时,依次取出C中的队列头的词语,从根节点遍历到叶节点,每遍历一层得到比原来的分值小的值就替换原始值,直到遍历到叶节点找到该词语的精确分值,放入队列;
第五步,循环第四步,当队列头的词语的分值等于该词在叶节点的精确分值,放入结果集中;
第六步,当结果集的大小等于k时,返回结果集。
作为本发明优选的技术方案,步骤一中,所述确定四叉树覆盖的地理范围是给定左上角和右上角的纬度坐标经。
作为本发明优选的技术方案,步骤一中,所述确定节点分裂规则为:设置每一个叶节点中的数据项不超过某个设定的阈值M,如果超过了则进行分裂为四个叶子节点;或者直接限定树的深度。
作为本发明优选的技术方案,步骤四中,所述每一个叶子节点存储包含的讯息中所有文本信息的摘要;该步骤采用MG摘要信息的计算过程算法为:
给定一个参数k,k表示用户可指定的结果关键词的个数,一个MG摘要存储k-1个<项,数目>对,针对数据流中的每一个新进的项i有以下三种情况分别进行处理:
1)如果i已经在当前的计数器中被保存,那么给它的计数器值增加1;
2)如果i不在管理集中,计数器的数目还没有达到k个,那么将i插入到摘要中,并将其计数器值设为1;
3)如果i不在管理集中,并且摘要已经保存了k个计数器,我们将管理中的信息的计数器值都减去1,并移除掉所有计数器值为0的信息。
作为本发明优选的技术方案,步骤五中,所述MG聚合摘要信息的聚合过程为:
首先产生最多2k个计数器;接着是一个修剪操作:将这2k个计数器中的值按照从小到大的顺序排列,取出第(k+1)个计数器,并从所有的计数器中减去这个计数器的值;最后,移除所有非正数的计数器;所述聚合过程在常数次数的排序操作,并有O(k)复杂度的摘要的扫描的情况下完成。
作为本发明优选的技术方案,步骤六中,如果滑动窗口还没有满,当一个新的信息到来,被插入到四叉树的叶子节点中,那么这个节点的摘要也会随之更新;接着,它的父节点也会更新其合并的摘要;这个过程将会一直向上迭代,直到四叉树的根节点获得最新的聚合摘要信息;如果滑动窗口已经满了,当数据流中来了一个新的信息,也被插入了,那么有着最旧时间戳的信息将被删掉;接着,索引更新的过程就与滑动窗口未满时候的情况一样了。
作为本发明优选的技术方案,第二步中,所述剪枝操作过程如下:从用户输入得到确切的k值之后,重新计算第k个词的分值,将该分值中的“距离部分”设置为0算出的分值作为一个下界;接着,从根节点摘要中的第(k+1)个词开始,重新计算这些词的“距离部分”,使用最大的距离进行计算作为上界;当第i(i>k)个词的上界分值仍然小于第k个词的下界分值,那么认定第i个之后的词在不久的未来k次操作也不能到达优先队列的顶部。
作为本发明优选的技术方案,第三步中,所述分值按以下步骤计算:
(1)利用每一个节点中存储的摘要来计算分值:等式(1)定义了计算分值的公式,
令D为一个二维的欧式空间,W为滑动窗口,S是在D和W内的一系列地理文本信息的集合;每一个地理文本信息表示为o=(pos,text),其中pos是D中的一个位置点,text是文本信息;定义滑动窗口W中一个词t的位置感知词频分值:
Figure PCTCN2017113483-appb-000001
其中,freq(t)是包含词t的信息的数目,|W|是在滑动窗口中的信息的总数目,d(q,Wt)是查询点q与窗口W中包含t的信息的距离之和,ddiag是矩形区域R的对角线长度,|Wt|表示的是W中包含词t的信息的数目,α是平衡在词频与位置邻近度之间的权重的参数,该分值实质是W中的词的词频和该词与查询点q之间的距离的线性组合;将分数的计算公式分为“频率部分”
Figure PCTCN2017113483-appb-000002
和“距离部分”
Figure PCTCN2017113483-appb-000003
由于MG摘要在最多误差为n/(k+1)的情况下估算任意项的频率,n是所有讯息的数目,将这个最大的误差加到freq来计算“频率部分”;d(q,Wt)是包含词t的信息与查询点之间的距离之和,使用查询点到包含这个词的节点的四条边的最小距离来作为一个上界;“距离部分”计算要考虑对于同一个词的冗余计算,包含了对一个节点中同一个词出现的信息数目的一个除法操作,以及通过一个线性权重参数α计算两部分的和,将其归一化到[0,1]的区间;
(2)在得到每一个节点内每一个词的分值后,词的分值需要被整合来计算该词在整棵树中的分值;该步通过将某些节点中该词的分值相加,使得该分值尽可能地大,在这个过程中,必须遵守一个规则是这些节点必须要覆盖整棵四叉树。
作为本发明优选的技术方案,第四步中,所述C中的队列头的词语是当前有着最大score的词语。
此外,本发明还提供一种滑动窗口下基于位置top-k关键词查询的快速索引系统,包括构建数据索引模型模块和查询模块;
所述构建数据索引模型模块包括四叉树地理范围及分裂规则确定单元、数据插入单元、四叉树调整单元;所述数据插入单元包括叶节点存储倒排索引、非叶节点存储其子节点的MG聚合摘要;所述四叉树调整单元包括滑动窗口插入新数据、删掉具有最旧时间戳的数据;
所述查询模块包括初始化结果集单元、剪枝操作单元、优先队列存储结果单元;所述初始化结果集单元用于输入构建好的四叉树和查询节点以及k,建立一个列表作为结果集,初始化为空,k表示用户可指定的结果关键词的个数;所述剪枝操作单元用于根据构建好的四叉树的根节点的MG摘要以及k进行剪枝操作,得到候选结果集,剪枝操作包括根据分值计算的距离部分上限的计算替代,缩小计算范围,确保能返回k个关键词;所述优先队列存储结果单元包括优先队列中取最大分值的词开始计算,从根节点开始遍历直到在叶节点找到其精确分值,精确值放入队列,重复直到优先队列前k个词不再变化。
与现有技术相比,本发明具有以下有益效果:
1、本发明定义了一个处理LkTQ通过考虑地理文本数据集的词频和位置邻近度来查找top-k局部最流行的关键词的新问题。
2、本发明提出了一种具有低存储和更新成本的混合四叉树索引结构和具有有效修剪策略的搜索算法,使得能够实现快速准确的top-k关键词搜索。特别地,由于不可能将每个消息存储在巨大的流数据中,本发明在四叉树的每个节点上增加了一个总结文件用来存储词频的总结。非叶子节点通过存储其子节点的合并摘要来维护一个上限错误。此外,本发明在四叉树结点中存在着大量的合并操作,采用MG摘要的合并操作是轻量级的,并且对于频率的准确性有保证。本发明能有效降低成本、提升查询速度,可以同时根据词频和位置邻近度 有效地修剪搜索空间,且能够处理具有高到达率的地理文本数据流。
3、经实验验证,本发明方法比现有基准方法更加有效,查询速度更快;当一个结点中的信息数目达到M,M的范围从100到2000,我们的方法比基准方法快很多。当M在300到500的范围内,时间消耗有一点点下降。在这个范围内,有最好的性能。
4、经实验验证,本发明方法比现有基准方法的准确度更高。当目标k设置在一个较小的数值,我们的算法有很精确的结果,可以保证80%的准确度。
5、经实验验证,本发明方法可以满足用户的偏好需求。
附图说明
下面结合附图和实施例对本发明进一步说明。
图1是基于位置的top-k关键词查询(LkTQ)在中国区域的一个查询实例示意图;其中,图1(a)表示信息和距离;图1(b)表示标签云。
图2是本发明滑动窗口下基于位置top-k关键词查询的快速索引方法的流程图。
图3是本发明四叉树的索引模型的基本结构示意图。
图4是本发明滑动窗口下基于位置top-k关键词查询的快速索引系统的框架示意图;
图5是本发明实验中在不同数据量下更新索引的时间消耗结果对比示意图;
图6是本发明实验中改变四叉树叶节点中的信息容量的结果对比示意图;其中,图6(a)是当数据集数目是10,000时的时间成本对比结果示意图;图6(b)是改变滑动窗口中数据量大小的时间成本结果对比示意图。
图7是本发明实验中改变目标k值的结果对比示意图;其中,图7(a)是基准算法与本发明算法在时间成本上的对比结果示意图;图7(b)是改变滑动窗口下针对不同数据量大小改变k值后的时间成本对比结果示意图;图7(c)是滑动窗口不同数据量下针对k值剪枝前后候选词数目的对比结果示意图;
图8是本发明实验中本发明算法与基准算法之间的精确度对比结果示意图。
具体实施方式
现在结合附图对本发明作进一步详细的说明。这些附图均为简化的示意图,仅以示意方式说明本发明的基本结构,因此其仅显示与本发明有关的构成。
一、问题定义
令D为一个二维的欧式空间,W为滑动窗口,S是在D和W内的一系列地理文本信息的集合。每一个地理文本信息表示为o=(pos,text),其中pos是D中的一个位置点,text是文本信息。一个LkTQ q由一个元组(loc,k),其中loc表示查询位置点,k表示用户可指定的结果关键词的个数。最后返回在W内信息中的k个有着最高位置感知词频分值的关键词。
滑动窗口W中的一个词t的位置感知词频分值被定义为W中的词的词频和该词与查询点q之间的距离的线性组合:
Figure PCTCN2017113483-appb-000004
其中,freq(t)是包含词t的信息的数目,|W|是在滑动窗口中的信息的总数目,d(q,Wt)是查询点q与滑动窗口W中包含t的信息的距离之和,ddiag是矩形区域R的对角线长度,|Wt|表示的是W中包含词t的信息的数目,α是平衡在词频与位置邻近度之间的权重的参数。
二、频繁项计算
在数据流处理中,聚合是一个被广泛研究的问题。现有的聚合技术可以被分为基于计数器的方法和基于草图的方法。
基于计数器的方法使用一个带有固定大小的计数器来存储所有项,每一条讯息都存储在作为S的子集的独立计数器中。当在管理集中的一项又出现,它的计数器被更新。如果这一项没有在管理集中而且计数器已经满了,那么在不同算法中会对这种情况进行不同的处理。比如,Space-Saving算法会找到有着最小计数器值的项,用新项替代它,然后将该新项的计数器加1。
另外一个很流行的算法—MG摘要实现起来也非常简单。给定一个参数k,一个MG摘要存储k-1个(项,数目)对,针对数据流中的每一个新进的项i有以下三种情况分别进行处理:
(1)如果i已经在当前的计数器中被保存,那么给它的计数器值增加1;
(2)如果i不在管理集中,计数器的数目还没有达到k个,那么将i插入到摘要中,并将其计数器值设为1;
(3)如果i不在管理集中,并且摘要已经保存了k个计数器,我们将管理中的信息的计数器值都减去1,并移除掉所有计数器值为0的信息。
其他显著的基于计数器的算法包括LossyCounting[G.S.Manku and R.Motwani.Approximate frequency counts over data streams.In VLDB,2002.]和Frequent[E.D.Demaine,A.L_opez-Ortiz,and J.I.Munro.Frequency estimation of internet packet streams with limited space.In AlgorithmsESA,2002.,R.M.Karp,S.Shenker,and C.H.Papadimitriou.A simple algorithm for finding frequent elements in streams and bags.TODS,2003.]。
基于草图的方法通过哈希方法管理所有的信息集合而不是仅仅管理信息的子集。信息被哈希到计数器空间中,哈希过的计数器将在每一个对应项被击中的时候被更新。CountSketch算法[M.Charikar,K.Chen,and M.Farach-Colton.Finding frequent items in data streams.In ICALP,2002.]解决有1-δ成功概率找到近似top的关键词的问题。GroupTest算法[G.Cormode and S.Muthukrishnan.What's hot and what's not:tracking most frequent items dynamically.TODS,2005.]旨在搜索关于频繁项的查询,并实现一个常数概率δ的失败。事实上它一般情况下是精准的。Count-Min Sketch[G.Cormode and S.Muthukrishnan.An improved data stream summary:the count-min sketch and its applications.Journal of Algorithms,2005.]也是一个具有代表性的基于草图的方法。
基于草图的方法由于哈希碰撞,相比于基于计数器的方法而言精度更低,对于频率估计也无法提供可靠保证。此外,它们不能在连续的流中提供保持相对顺序的保证。因此,我们在这个工作中使用的是基于计数器的方法。
三、本发明方法具体流程
如图2所示,本发明滑动窗口下基于位置top-k关键词查询的快速索引方法,包括如下步骤:
1、构建数据索引模型(四叉树的索引模型)阶段
为了更快地索引,本发明使用了一个基于四叉树的索引结构来存储流中搜索的地理文本信息。四叉树的基本思想是将底层空间分成不同层次的单元。也别的,它迭代地将空间分成4个全等的子空间,直到这棵树达到一定的深度或者达到一定的停止条件。四叉树广泛应用于图像处理、空间数据索引、二维环境中的快速碰撞检测、稀疏数据等。本发明四叉树的索引模型的基本结构见图3。需要提醒的是,节点的不同形状标识对应右侧矩形中的各个以相同形状为中心点分裂为四个相同的四分单元(每一个四分单元中的每一个为一个节点),根 节点(图中三角形节点)代表整个矩形区域。在叶子节点中存储的是倒排索引,非叶子节点中存储是的合并的摘要。
四叉树有一个非常简单的结构,当空间文本信息分布相对一致的时候,它有着相对高的插入和更新效率。图3中黑色的点是它们被准确发布的位置上的信息。在我们的算法中,我们设置M为一个叶子节点中存储信息的最大条数。换句话说,如果存储在一个叶子节点的信息的数目超过了M,该节点将会变成一个非叶子节点,并分裂成四个有着相同尺寸大小的叶子节点单元。
构建数据索引模型具体包括如下步骤:
(1)首先需要确定四叉树覆盖的地理范围(一般是给定左上角和右上角的纬度坐标经),以及节点分裂规则,目的在于控制整棵四叉树的深度,例如,本发明采用的是设置每一个叶节点中的数据项不超过某个设定的阈值M,如果超过了则进行分裂为四个叶子节点,也可以直接限定树的深度;
(2)接受数据流,向节点中插入数据;
(3)达到阈值的节点分裂,数据插入不断生成完整的四叉树;
(4)对每一个叶节点,统计其词频,存储倒排索引;四叉树的每一个叶子节点(即叶节点)存储包含的讯息中所有文本信息的摘要。MG摘要信息的计算过程算法(称为算法1)为:
给定一个参数k,k表示用户可指定的结果关键词的个数,一个MG摘要存储k-1个<项,数目>对,针对数据流中的每一个新进的项i有以下三种情况分别进行处理:
1)如果i已经在当前的计数器中被保存,那么给它的计数器值增加1;
2)如果i不在管理集中,计数器的数目还没有达到k个,那么将i插入到摘要中,并将其计数器值设为1;
3)如果i不在管理集中,并且摘要已经保存了k个计数器,我们将管理中的信息的计数器值都减去1,并移除掉所有计数器值为0的信息。
(5)对每一个非叶节点,存储其所有子节点的MG聚合摘要信息;
在这个MG摘要算法中,叶子节点和非叶子节点都存储了讯息的摘要。在叶子节点中,摘要是通过以上算法1中的过程来计算的,但是在非叶子节点(即非叶节点)中,摘要就来自于MG摘要合并的方法合并出来的。[P.K.Agarwal,G.Cormode,Z.Huang,J.Phillips,Z.Wei,and K.Yi.Mergeable summaries.In PODS,2012.]证明了MG摘要和SS摘要是同构的,SS摘要可以通过MG摘要转化而来。由于MG摘要的合并操作非常地简单有效,并且在四叉树中有许多的合并操作,因此本发明采用MG摘要而不是SS摘要。合并MG摘要的过程也是非常简单的。MG摘要信息的聚合过程为:
这一步将会产生最多2k个计数器。接着是一个修剪操作:将这2k个计数器中的值按照从小到大的顺序排列,取出第(k+1)个计数器,并从所有的计数器中减去这个计数器的值。最后,我们移除所有非正数的计数器。明显这是一个很高效的过程:这一个聚合过程可以在常数次数的排序操作,并有O(k)复杂度的摘要的扫描的情况下完成。
(6)针对(4)(5)两步的数据插入过程中,在这个过程中需要维护滑动窗口的大小,删掉具有最旧时间戳的数据项,添加最新的数据,调整四叉树的索引结构。
与基于区域的关键词查询[A.Skovsgaard,D.Sidlauskas,C.S.Jensen.Scalable top-k spatio-temporal term querying.In ICDE,2014.]不同,LkTQ的位置是一个点而不是一个特定的空间区域。我们想要找到综合考虑位置邻近度和词频的情况下局部最流行的k个关键词。如果滑动窗口还没有满,当一个新的信息到来,被插入到四叉树的叶子节点中,那么这个节点的摘要也会随之更新。接着,它的父节点也会更新其合并的摘要。这个过程将会一直向上迭代,直到四叉树的根节点获得最新的合并摘要信息。如果滑动窗口已经满了,当数据流中来 了一个新的信息,也被插入了,那么有着最旧时间戳的信息将被删掉。接着,索引更新的过程就与滑动窗口未满时候的情况一样了。
2、查询阶段(采用最优优先查询算法)
令D为一个二维的欧式空间,W为滑动窗口,S是在D和W内的一系列地理文本信息的集合。每一个地理文本信息表示为o=(pos,text),其中pos是D中的一个位置点,text是文本信息。我们首先定义滑动窗口W中一个词t的位置感知词频score:
Figure PCTCN2017113483-appb-000005
其中,freq(t)是包含词t的信息的数目,|W|是在滑动窗口中的信息的总数目,d(q,Wt)是查询点q与窗口W中包含t的信息的距离之和,ddiag是矩形区域R的对角线长度,|Wt|表示的是W中包含词t的信息的数目,α是平衡在词频与位置邻近度之间的权重的参数。该score实质是W中的词的词频和该词与查询点q之间的距离的线性组合。
给定一个词,我们需要两个步骤来获得它的score:
(1)首先,我们需要利用每一个节点中存储的摘要来计算分值。等式(1)定义了计算分值的公式。为了方式计算,我们将分数的计算公式分为“频率部分”
Figure PCTCN2017113483-appb-000006
和“距离部分”
Figure PCTCN2017113483-appb-000007
从本质上来说,这个分值是这两个部分的一个线性组合。由于MG摘要在最多误差为n/(k+1)的情况下估算任意项的频率(n是所有讯息的数目),我们将这个最大的误差加到freq来计算“频率部分”。d(q,Wt)是包含词t的信息与查询点之间的距离之和,这里,我们使用查询点到包含这个词的节点的四条边的最小距离来作为一个上界。
由于一个词在一个节点内可能会出现不止一次,我们需要考虑在距离计算当中对于同一个词的冗余计算。接着,“距离”部分包含了对一个节点中同一个词出现的信息数目的一个除法操作。最后,我们通过一个线性权重参数α计算两部分的和,将其归一化到[0,1]的区间。
(2)在我们得到每一个节点内每一个词的分值后,词的分值需要被整合来计算该词在整棵树中的分值。这一步通过将某些节点中该词的分值相加,使得该分值尽可能地大。在这个过程中,必须遵守一个规则是这些节点必须要覆盖住整个给定的区域(整棵四叉树)。
最优优先查询算法具体包括如下步骤:
(1)输入构建好的四叉树和查询节点以及k,建立一个列表作为结果集,初始化为空;
(2)根据构建好的四叉树的根节点的MG摘要以及k进行剪枝操作,得到候选结果集;
α是一个用来平衡位置邻近度和词频的参数。C是存储所有候选词的一个优先队列。为了得到候选词,我们提取四叉树的根节点的摘要。然而,如果候选词在许多节点中都有存储,数目很大,但是用户指定的k值又是一个很小的数目的话,那么大量的计算无用结果的词语分值将会引发额外大量的时间成本。因此,我们想出了一个剪枝策略,在保证不会遗失任何候选词的情况下能够避免不必要的计算。
剪枝操作过程如下:我们从用户输入得到确切的k值之后,我们重新计算第k个词的分值,将该分值中的“距离部分”设置为0算出的分值作为一个下界。接着,从根节点摘要中的第 (k+1)个词开始(因为摘要是已经排好序的),我们重新计算这些词的“距离部分”,使用最大的距离进行计算作为上界。当第i(i>k)个词的上界分值仍然小于第k个词的下界分值,那么我们认定第i个之后的词在不久的未来k次操作算法2中的4-13行的过程中也不能到达优先队列的顶部。
(3)使用一个最大堆C存储候选结果集中的每个词语以及其score;C是存储所有候选词的一个优先队列。
(4)当结果集的大小小于k时,依次取出C中的队列顶端的词语(当前有着最大score的词),从根节点遍历到叶节点,每遍历一层得到比原来的score小的值就替换原始值,直到遍历到叶节点找到该词语的精确score(因为叶节点中存放的倒排索引才是真实统计的词频),放入队列;
(5)循环步骤(4),当队列头的词语的score等于该词在叶节点的精确score,放入结果集中;
找到一个词的精确分值的过程。对于每一个从优先队列顶端弹出的候选词,我们从根部到叶子节点遍历整棵树。如果我们在一个子节点中找到比父节点中更小的分值,我们将较小的分值替换掉当前的分值,并将这个新的分值插入到优先队列中,直到我们得到一个足够小的分值与优先队列中的头部元素相等。接着,这个有着精确分值的词将会被加入到我们的结果集中。
(6)当结果集的大小等于k时,返回结果集。
四、本发明系统
如图4所示,本发明一种滑动窗口下基于位置top-k关键词查询的快速索引系统,包括构建数据索引模型模块和查询模块;
所述构建数据索引模型模块包括四叉树地理范围及分裂规则确定单元、数据插入单元、四叉树调整单元;所述数据插入单元包括叶节点存储倒排索引、非叶节点存储其子节点的MG聚合摘要;所述四叉树调整单元包括滑动窗口插入新数据、删掉具有最旧时间戳的数据;
所述查询模块包括初始化结果集单元、剪枝操作单元、优先队列存储结果单元;所述初始化结果集单元用于输入构建好的四叉树和查询节点以及k,建立一个列表作为结果集,初始化为空,k表示用户可指定的结果关键词的个数;所述剪枝操作单元用于根据构建好的四叉树的根节点的MG摘要以及k进行剪枝操作,得到候选结果集,剪枝操作包括根据分值计算的距离部分上限的计算替代,缩小计算范围,确保能返回k个关键词;所述优先队列存储结果单元包括优先队列中取最大分值的词开始计算,从根节点开始遍历直到在叶节点找到其精确分值,精确值放入队列,重复直到优先队列前k个词不再变化。
五、实验与分析
我们通过实验来验证我们的解决方案并与其余的可行方法进行对比。所有的实验都是在Intel(R)Xeon(R)CPU E5-2643 0@3:30GHz配置和64GB内存的64位Windows操作系统的工作站上实现的。整个框架是使用Java来实现的。
包含推文的数据集是在美国区域采集的。一共有20,000,000条数据,其中的每一条数据包含一个时间戳,一个词语列表以及推文发布的经度和纬度(也就是,用户设定的地理标签)。注意每一个实验的结果都是通过对不同的查询输入进行了超过10次不同实验取的平均值。
1、基准
我们使用每一次滑动窗口中有新的数据就进行精确计算的算法作为基准方法来与本发明方法进行对比和验证。基准方法中使用的索引结构也是基于四叉树的。特别地,在四叉树的每一个叶子节点中,我们存储的是每一个词的精确频率。当一条信息到达,我们更新对应节点中的频率表。为了得到一个非叶子节点的频率信息,我们需要迭代遍历整个节点直到到达叶子节点。这个方法可以返回一个LkTQ的精确结果。因此,它被用来在我们的后续实验中 用于对查询结果精确度的一个衡量标准。
2、四叉树的索引更新
首先,我们做了一个实验来评估当在滑动窗口中插入和删除一条信息的性能。因为我们只在找到一个滑动窗口中的top-k关键词,当滑动窗口已经满了,每一次一条新的信息到达,一条旧的信息就应该被删除。
我们发现在基准方法和本发明方法中的两个操作其实几乎不怎么消耗时间的,因为是基于一个已经构建好的四叉树上。因此,我们做了另外的一个实验来了解构建一颗四叉树包含词频计算和索引更新的过程的时间消耗。实验结果在图5中,其中baseline代表基准方法,LkTQ代表本发明方法。
特别地,对于基准方法,构建四叉树包含计算和合并所有的词频,对于本发明方法,构建过程包含计算四叉树中所有节点的MG摘要。正如我们可以看到的,本发明方法的构建四叉树的过程时间消耗比基准方法大很多。然而,我们做了更多实验来证明,即使是在这样的情况下,本发明方法仍然比基准方法更加有效。
3、改变四叉树叶节点中的信息容量
之前提到当我们创建一颗四叉树来索引所有信息时,我们会有一个条件来决定我们何时分裂节点和生成新的子节点。这个条件就是当一个节点中的信息数目达到M,然后这个节点就变成一个父节点然后分裂。我们做了实验来改变一个叶节点中村存储的最大信息数目,这样我们可以找到哪个M可以获得更好的性能,是否对实验结果有影响。其余的参数设置为:k=20,α=0.7,MG摘要中的计数器的数目为500.特别地,计数器的数目设为500主要是针对大的数据集时可以减少摘要的误差。
图6给出了结果。图6(a)是当数据集数目是10,000时的对比结果。M的范围从100到2000。本发明方法(LkTQ)比基准方法(baseline)快很多。改变M有一点的波动。在基准方法中,四叉树的叶节点的信息容量对于性能而言并没有非常明显的影响。一旦固定了M,这棵树就固定了,所有的分值就可以被计算。然而,在我们的算法中,M是影响性能的。理论上,M越大,四叉树的深度就越小。因为,当计算每一个节点的分值时,我们在计算“距离部分”时使用了到查询点最近的边,如果这棵树越深,那么这个距离将会越小,叶节点的数目将会越大。从图6(b)中可以看出,当M增大,时间消耗越大。当M变得越来越大,分裂的成本就越大。当M在300到500的范围内,时间消耗有一点点下降。在这个范围内,有最好的性能。
4、改变k
在这个实验中,我们改变目标k值。这个目标k值实际上是用户指定的,其余的固定参数设置为:α=0.7,每个叶子节点中的信息的最大数目M为1000,MG摘要中的计数器数目为100。尽管M在300到500的范围内有最好的性能,选择1000是为了控制四叉树的深度,得到更加精确的结果。因为,实验证明,当M接近1000的时候,当其余参数改变的时候得到的结果是一致的。
图7给出了结果。目标k的范围是根据用户的普通需求来设定的。本发明算法的性能比一个一个计算的基准方法显著好很多(见图7(a))。图7(a)中的数据集大小是10,000,然而,基准方法大约需要7分钟返回结果。基准方法的时间消耗在大约400,000ms的稳定和效率低的水平上。对于更大的数据集,基准方法有着更慢的运行速度,比如,处理5,000条信息,他需要将近1200万毫秒,处理100,000条信息需要将近6千万毫秒,效率非常低。因此我们不再比较没有可比性的结果。
事实上,正如预期,本发明算法时间消耗随着目标k增加而增大。在图7(a)上的刻度标签上不是非常能够明显看到时间成本的巨大差异。因此,我们的另外一个实验来证明这一 差异,结果如图7(b)。此外,随着数据集的大小变大,结果的趋势就变得愈发明显。特别地,为了找到运行速度快的根源,我们做了另外一个实验来了解使用我们根据k的修剪算法之后,实际候选集的数目与k是很接近的。该结果在图7(c)中。从图7(c)中体现了剪枝操作对于候选词的计算已经有非常大的压缩,通过k剪枝后可以只需要计算稍大于k数量的候选词,如果没有这一步剪枝操作,将需要计算根节点中所有的候选词,这个数目在窗口并不很大的情况下通常也是成千上万的。如果指定的查询k值很小,不必要的计算成本就相当高。可见,本发明方法中的有效剪枝操作步骤,在保证不会遗失任何候选词的情况下能够避免不必要的计算,大大降低了计算成本。
5与基准方法对比精确
准确度是用户关心的一个重要因素。本发明算法与基准方法之间的精确度对比结果见图8。我们针对不同的数据集的大小衡量了我们算法返回的正确的top-k关键词的比例。因为基准方法有非常低效的运行速度,我们选择了相对较小的数据集,然而,并不会影响本发明算法的高性能。当目标k设置在一个较小的数值,本发明算法有很精确的结果,可以保证80%的准确度。随着目标k变大,准确度会有一点点下降。然而,最低的准确度也是在0.39之上,并且是当目标k的值是100,能够满足绝对多数用户的需求。
6、改变参数α
α是一个平衡分值计算公式中的权重参数。改变α的值实质上就是调整距离和词频的影响程度。这决定于用户的偏好。通过实验可以证明我们算法的结果对于α在(0.9,1.0)的区间范围内是敏感的。当然,当α被设置为0或者1,那么结果就代表了距离或者词频单方面的影响。特别地,α的敏感范围是由数据集的分布所影响的。然而,我们的实验结果证明通过改变α,算法是对结果敏感的,因此可以满足用户的偏好需求。
以上述依据本发明的理想实施例为启示,通过上述的说明内容,相关工作人员完全可以在不偏离本项发明技术思想的范围内,进行多样的变更以及修改。本项发明的技术性范围并不局限于说明书上的内容,必须要根据权利要求范围来确定其技术性范围。

Claims (10)

  1. 一种滑动窗口下基于位置top-k关键词查询的快速索引方法,其特征在于,包括构建数据索引模型阶段以及查询阶段;
    所述构建数据索引模型阶段具体包括如下步骤:
    步骤一,确定四叉树覆盖的地理范围以及节点分裂规则;
    步骤二,接受数据流,向节点中插入数据;
    步骤三,符合步骤一节点分裂规则的节点分裂,数据插入不断生成完整的四叉树;
    步骤四,对每一个叶节点,统计其词频,存储倒排索引;
    步骤五,对每一个非叶节点,存储其所有子节点的MG聚合摘要信息;
    步骤六,针对步骤四和步骤五两步的数据插入过程中,在这个过程中需要维护滑动窗口的大小,删掉具有最旧时间戳的数据项,添加最新的数据,调整四叉树的索引结构;
    所述查询阶段具体包括如下步骤:
    第一步,输入构建好的四叉树和查询节点以及k,建立一个列表作为结果集,初始化为空;k表示用户可指定的结果关键词的个数;
    第二步,根据构建好的四叉树的根节点的MG摘要以及k进行剪枝操作,得到候选结果集;
    第三步,使用一个最大堆C存储候选结果集中的每个词语以及其分值;C是存储所有候选词的一个优先队列;
    第四步,当结果集的大小小于k时,依次取出C中的队列头的词语,从根节点遍历到叶节点,每遍历一层得到比原来的分值小的值就替换原始值,直到遍历到叶节点找到该词语的精确分值,放入队列;
    第五步,循环第四步,当队列头的词语的分值等于该词在叶节点的精确分值,放入结果集中;
    第六步,当结果集的大小等于k时,返回结果集。
  2. 如权利要求1所述的方法,其特征在于,步骤一中,所述确定四叉树覆盖的地理范围是给定左上角和右上角的纬度坐标经。
  3. 如权利要求1所述的方法,其特征在于,步骤一中,所述确定节点分裂规则为:设置每一个叶节点中的数据项不超过某个设定的阈值M,如果超过了则进行分裂为四个叶子节点;或者直接限定树的深度。
  4. 如权利要求1所述的方法,其特征在于,步骤四中,所述每一个叶子节点存储包含的讯息中所有文本信息的摘要;该步骤采用MG摘要信息的计算过程算法为:
    给定一个参数k,k表示用户可指定的结果关键词的个数,一个MG摘要存储k-1个<项,数目>对,针对数据流中的每一个新进的项i有以下三种情况分别进行处理:
    1)如果i已经在当前的计数器中被保存,那么给它的计数器值增加1;
    2)如果i不在管理集中,计数器的数目还没有达到k个,那么将i插入到摘要中,并将其计数器值设为1;
    3)如果i不在管理集中,并且摘要已经保存了k个计数器,我们将管理中的信息的计数器值都减去1,并移除掉所有计数器值为0的信息。
  5. 如权利要求1所述的方法,其特征在于,步骤五中,所述MG聚合摘要信息的聚合过程为:
    首先产生最多2k个计数器;接着是一个修剪操作:将这2k个计数器中的值按照从小到大的顺序排列,取出第(k+1)个计数器,并从所有的计数器中减去这个计数器的值;最后,移除所有非正数的计数器;所述聚合过程在常数次数的排序操作,并在有O(k)复杂度的摘要扫描的情况下完成。
  6. 如权利要求1所述的方法,其特征在于,步骤六中,如果滑动窗口还没有满,当一个新的信息到来,被插入到四叉树的叶子节点中,那么这个节点的摘要也会随之更新;接着,它的父节点也会更新其合并的摘要;这个过程将会一直向上迭代,直到四叉树的根节点获得最新的聚合摘要信息;如果滑动窗口已经满了,当数据流中来了一个新的信息,也被插入了,那么有着最旧时间戳的信息将被删掉;接着,索引更新的过程就与滑动窗口未满时候的情况一样了。
  7. 如权利要求1所述的方法,其特征在于,第二步中,所述剪枝操作过程如下:从用户输入得到确切的k值之后,重新计算第k个词的分值,将该分值中的“距离部分”设置为0算出的分值作为一个下界;接着,从根节点摘要中的第(k+1)个词开始,重新计算这些词的“距离部分”,使用最大的距离进行计算作为上界;当第i(i>k)个词的上界分值仍然小于第k个词的下界分值,那么认定第i个之后的词在不久的未来k次操作也不能到达优先队列的顶部。
  8. 如权利要求1所述的方法,其特征在于,第三步中,所述分值按以下步骤计算:
    (1)利用每一个节点中存储的摘要来计算分值:等式(1)定义了计算分值的公式,
    令D为一个二维的欧式空间,W为滑动窗口,S是在D和W内的一系列地理文本信息的集合;每一个地理文本信息表示为o=(pos,text),其中pos是D中的一个位置点,text是文本信息;定义滑动窗口W中一个词t的位置感知词频分值:
    Figure PCTCN2017113483-appb-100001
    其中,freq(t)是包含词t的信息的数目,|W|是在滑动窗口中的信息的总数目,d(q,Wt)是查询点q与窗口W中包含t的信息的距离之和,ddiag是矩形区域R的对角线长度,|Wt|表示的是W中包含词t的信息的数目,α是平衡在词频与位置邻近度之间的权重的参数,该分值实质是W中的词的词频和该词与查询点q之间的距离的线性组合;将分数的计算公式分为“频率部分”
    Figure PCTCN2017113483-appb-100002
    和“距离部分”
    Figure PCTCN2017113483-appb-100003
    由于MG摘要在最多误差为n/(k+1)的情况下估算任意项的频率,n是所有讯息的数目,将这个最大的误差加到freq来计算“频率部分”;d(q,Wt)是包含词t的信息与查询点之间的距离之和,使用查询点到包含这个词的节点的四条边的最小距离来作为一个上界;“距离部分”计算要考虑对于同一个词的冗余计算,包含了对一个节点中同一个词出现的信息数目的一个除法操作,以及通过一个线性权重参数α计算两部分的和,将其归一化到[0,1]的区间;
    (2)在得到每一个节点内每一个词的分值后,词的分值需要被整合来计算该词在整棵树中的分值;该步通过将某些节点中该词的分值相加,使得该分值尽可能地大,在这个过程中,必须遵守一个规则是这些节点必须要覆盖整棵四叉树。
  9. 如权利要求1所述的方法,其特征在于,第四步中,所述C中的队列头的词语是当前有着最大score的词语。
  10. 一种滑动窗口下基于位置top-k关键词查询的快速索引系统,其特征在于,包括构建数据索引模型模块和查询模块;
    所述构建数据索引模型模块包括四叉树地理范围及分裂规则确定单元、数据插入单元、四叉树调整单元;所述数据插入单元包括叶节点存储倒排索引、非叶节点存储其子节点的 MG聚合摘要;所述四叉树调整单元包括滑动窗口插入新数据、删掉具有最旧时间戳的数据;
    所述查询模块包括初始化结果集单元、剪枝操作单元、优先队列存储结果单元;所述初始化结果集单元用于输入构建好的四叉树和查询节点以及k,建立一个列表作为结果集,初始化为空,k表示用户可指定的结果关键词的个数;所述剪枝操作单元用于根据构建好的四叉树的根节点的MG摘要以及k进行剪枝操作,得到候选结果集,剪枝操作包括根据分值计算的距离部分上限的计算替代,缩小计算范围,确保能返回k个关键词;所述优先队列存储结果单元包括优先队列中取最大分值的词开始计算,从根节点开始遍历直到在叶节点找到其精确分值,精确值放入队列,重复直到优先队列前k个词不再变化。
PCT/CN2017/113483 2017-09-22 2017-11-29 滑动窗口下基于位置top-k关键词查询的快速索引方法及系统 WO2019056570A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710864358.2 2017-09-22
CN201710864358.2A CN107633068B (zh) 2017-09-22 2017-09-22 滑动窗口下基于位置top-k关键词查询的快速索引方法及系统

Publications (1)

Publication Number Publication Date
WO2019056570A1 true WO2019056570A1 (zh) 2019-03-28

Family

ID=61102510

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/113483 WO2019056570A1 (zh) 2017-09-22 2017-11-29 滑动窗口下基于位置top-k关键词查询的快速索引方法及系统

Country Status (2)

Country Link
CN (1) CN107633068B (zh)
WO (1) WO2019056570A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866003B (zh) * 2018-08-27 2023-09-26 阿里云计算有限公司 索引值数目的估算方法和装置以及电子设备
CN109635106A (zh) * 2018-11-01 2019-04-16 九江学院 一种用于时空数据的Top-k频率计算方法
CN110389965B (zh) * 2018-11-30 2023-03-14 上海德拓信息技术股份有限公司 一种多维度数据查询及缓存的优化方法
CN112527953B (zh) * 2020-11-20 2023-06-20 出门问问创新科技有限公司 规则匹配方法及装置
CN113407669B (zh) * 2021-06-18 2022-11-11 北京理工大学 一种基于活动影响力的语义轨迹查询方法
CN114756544A (zh) * 2022-03-22 2022-07-15 阿里云计算有限公司 一种热点识别方法及一种限流方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101174273A (zh) * 2007-12-04 2008-05-07 清华大学 基于元数据分析的新闻事件检测方法
CN101789028A (zh) * 2010-03-19 2010-07-28 苏州广达友讯技术有限公司 地理位置搜索引擎及其构建方法
CN102306183A (zh) * 2011-08-30 2012-01-04 王洁 一种对事务数据流进行闭合加权频繁模式挖掘的方法
US20170069123A1 (en) * 2013-02-05 2017-03-09 Facebook, Inc. Displaying clusters of media items on a map using representative media items
CN107506490A (zh) * 2017-09-22 2017-12-22 深圳大学 滑动窗口下基于位置top‑k关键词查询的优先查询算法及系统

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289507B (zh) * 2011-08-30 2015-05-27 王洁 一种基于滑动窗口的数据流加权频繁模式挖掘方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101174273A (zh) * 2007-12-04 2008-05-07 清华大学 基于元数据分析的新闻事件检测方法
CN101789028A (zh) * 2010-03-19 2010-07-28 苏州广达友讯技术有限公司 地理位置搜索引擎及其构建方法
CN102306183A (zh) * 2011-08-30 2012-01-04 王洁 一种对事务数据流进行闭合加权频繁模式挖掘的方法
US20170069123A1 (en) * 2013-02-05 2017-03-09 Facebook, Inc. Displaying clusters of media items on a map using representative media items
CN107506490A (zh) * 2017-09-22 2017-12-22 深圳大学 滑动窗口下基于位置top‑k关键词查询的优先查询算法及系统

Also Published As

Publication number Publication date
CN107633068A (zh) 2018-01-26
CN107633068B (zh) 2020-04-07

Similar Documents

Publication Publication Date Title
WO2019056569A1 (zh) 滑动窗口下基于位置top-k关键词查询的优先查询算法及系统
WO2019056570A1 (zh) 滑动窗口下基于位置top-k关键词查询的快速索引方法及系统
WO2019056568A1 (zh) 滑动窗口下基于位置top-k关键词查询的建模方法及系统
Cao et al. Keyword-aware optimal route search
Deng et al. Best keyword cover search
Li et al. Spatial approximate string search
Choudhury et al. Batch processing of top-k spatial-textual queries
Xu et al. Location-based top-k term querying over sliding window
JP2006072985A (ja) あいまいな重複に強い検出器
Balasubramanian et al. A state-of-art in R-tree variants for spatial indexing
Zhong et al. Location-aware instant search
Chen et al. Spatio-temporal top-k term search over sliding window
Abbasifard et al. Efficient indexing for past and current position of moving objects on road networks
CN110334290B (zh) 一种基于MF-Octree的时空数据快速检索方法
US8370363B2 (en) Hybrid neighborhood graph search for scalable visual indexing
Zheng et al. Searching activity trajectory with keywords
Dam et al. Efficient top-k recently-frequent term querying over spatio-temporal textual streams
Shin et al. An investigation of grid-enabled tree indexes for spatial query processing
Li et al. A parametric approximation algorithm for spatial group keyword queries
Tao et al. Range aggregation with set selection
Almaslukh et al. Temporal geo-social personalized search over streaming data
Wang et al. Efficient top/bottom-k fraction estimation in spatial databases using bounded main memory
Arseneau et al. STILT: Unifying spatial, temporal and textual search using a generalized multi-dimensional index
Yan et al. RDF knowledge graph keyword type search using frequent patterns
Lin et al. Efficient general spatial skyline computation

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17925789

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.10.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17925789

Country of ref document: EP

Kind code of ref document: A1